Introduction

This report focuses on exploring data using Python visualization techniques.

Dataset

Kaggle European Soccer Database, is in RDBMS format (SQLite DB) and contains European Professional Football records for 25k+ soccer matches, 11 lead championships in Europe and 79 players & teams attibutes between 2008 and 2016.

I used Pandas to retrieve the data from SQLite tables and store in corresponding dataframes. I, then, used Python analytics and visualization modules for further processing.

  • Analytics modules explored in this modules include numpy, scipy, nltk, and pandas.
  • Data visualization techniques involved include matplotlib, seaborn, mpl_toolkits, plotly, folium, networkx, PIL, and wordcloud.

Instead of SQL join, I used Pandas Merge to combine data in dataframes and conduct analyses.

Questions

In Exploratory phase I addressed quite a few questions. For example,

  • Does Player's "Overall Rating" depend on any Player Attribute(s)?
  • What are the similarities and differences in terms of skills among Top Overall_Rating Players?
  • Create a Word Cloud for players with Overall Rating beyond 88
  • Does the proportion of preferred foot change between all players and top players with overall rating higher than 80?, etc.
In [1]:
## SQLite related
import sqlite3

## Data Analytics
import numpy as np
from scipy.stats import mode
import pandas as pd
from math import pi
import nltk

## Data Visualization
import matplotlib.pyplot as plt
from matplotlib import animation,rc
import seaborn as sns
from mpl_toolkits.basemap import Basemap
from mpl_toolkits.mplot3d import Axes3D
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import folium
import folium.plugins
import networkx as nx
from PIL import Image
from wordcloud import WordCloud

## Systems
import io
from datetime import timedelta, date

import warnings
import base64
import itertools
from subprocess import check_output
import pprint as pp

%matplotlib inline
In [2]:
## Define Constants used in this notebook
ThisYear = date.today().timetuple()[0]

## Environment Setup:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
py.init_notebook_mode(connected=True)
warnings.filterwarnings("ignore")

Connect to SQLite Datebase

Load data from SQLite Tables to Pandas DataFrames.

  • sqlite3.connect() - Connect to SQLite database, named conn;
  • conn.cursor() - Create a Cursor;
  • cursor.execute() - Execute SQL queries to retrieve data;

    ### Retrieve Table Names in SQLite database

In [3]:
## Create a connection instance and cursor:
conn = sqlite3.connect("input/databaseSoccer.sqlite")
cursor = conn.cursor()

## Retrieve Table Names in a list
tblNames = cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tblNames = [ tblName[0].strip(",") for tblName in cursor.fetchall()]

## Verify which Tables have been collected: 
tblNames
Out[3]:
['sqlite_sequence',
 'Player_Attributes',
 'Player',
 'Match',
 'League',
 'Country',
 'Team',
 'Team_Attributes']
In [4]:
## Use Pandas to load data from SQLite tables to Pandas DataFrames, one for each table, and stored them in a Dict. 
def Table2DF(tblName):
  '''Return a dataframe with the entrie table content '''
  qryTable = "SELECT * FROM " + tblName + ";"
  return pd.read_sql_query(qryTable, conn)

DFs = {}  ## A Dict of DataFrames.

for tbl in tblNames:
  DFs["df" + tbl] =  Table2DF(tbl)

Initial Analyses

Define Functions

Define functions for common operations on initial exploration:

In [5]:
attrs = ['shape', "columns", "info", 'describe']  ## DF attributes to explore

simpAttrs = ['shape', "columns"]                  ## Simple Attributes
callAttrs = ['info', 'mean', 'describe', "head"]  ## Callable Attributes


## Categorical Attributes:
def descCatAttrs(df):
  '''Describe Categorical Attributes of a DataFrame'''
  return df.select_dtypes(exclude=["float64", "int64"]).describe()


## Uniqueness
def cntUniqValuesInCatAttrs(df):
  '''Count Unique Values in each Categorical Attribute'''
  return df.select_dtypes(exclude=['float64','int64']).apply(lambda x: len(x.unique().tolist()), axis = 0)

## Deal with NaN, Null:
def showAttrsWMissingValues(df):
  '''Detect Attribute with Missing Values and show a count of NaN on that attribute '''
  return df.isna().sum(axis = 0)

def showEntitiesWMissingValues(df):
  '''Detect Rows (Entities)) with missing values on Attributes (Columns) and show the counts '''
  return df.isna().sum(axis = 1)

def extractEntitiesWMissingValues(df):
  '''Extract Rows (Entities) with missing values on any Attribute to a new DataFrame '''
  return df.loc[ df.isna().any(axis = 1)]

def removeEntitiesWMissingValues(df):
  '''Extract Rows (Entities) with missing values on any Attribute to a new DataFrame '''
  return df.loc[ ~df.isna().any(axis = 1)]


## Deal with Attributes, i.e. Columns:
def getColNames(df):
  '''Return a list of Attribute Names (Columns) in a DataFrame '''
  return [ col for col in df.columns]

def getNumAttrs(df):
  '''Return a DataFrame with Numerical Attributes'''
  return df.select_dtypes(include =['float64','int64'])

def getCatAttrs(df):
  '''Return a DataFrame with Categorical Attributes'''
  return df.select_dtypes(exclude =['float64','int64'])

def getMbrNames(df):
  '''Return a list of Name attribute in this dataframe'''
  return [member for member in df['name']] if 'name' in df else "No name attribute in this DataFrame."

## View basic form factor of a DataFrame:
def viewDF(df, *attrs):
  '''Analyzing DataFrame - show basic attributes of a DataFrame '''
 
  for attr in attrs:
  
    if attr in simpAttrs:
     
      if attr == "columns":
        print(" DataFrame {}: {}".format(attr, [ col for col in getattr(df, attr)] ))
      else:
        print(" DataFrame {}: {}".format(attr, getattr(df, attr)))

    elif attr in callAttrs:
      if attr == "info":
        print("\ninfo():")
        getattr(df, attr)()
      else:
        print("\n{}(): {}".format(attr, getattr(df, attr)() ))
In [6]:
##  Get an Idea about the dataset

for tbl in tblNames:
  '''A list of attributes to get an idea about a dataframe '''

  if tbl == "sqlite_sequence" or "Attributes" in tbl:
    continue

  dfTblName = "df" + tbl
  print('Analyzing "{}" Table >> DataFrame "{}" [Dim:{}]:\n'.format(tbl, dfTblName, DFs[dfTblName].shape))

  ## Basic understanding :
  viewDF(DFs[dfTblName], *attrs)

  ## More details:
  print("\nDescribe Categorical Attributes: {}".format(descCatAttrs(DFs[dfTblName])))
  print("\n{} Attributes with Missing Values: {}".format(tbl, showAttrsWMissingValues(DFs[dfTblName])))
  print("\n{} Members with Missing Values: {}".format(tbl, showEntitiesWMissingValues(DFs[dfTblName])))
  print(80*"*")
Analyzing "Player" Table >> DataFrame "dfPlayer" [Dim:(11060, 7)]:

 DataFrame shape: (11060, 7)
 DataFrame columns: ['id', 'player_api_id', 'player_name', 'player_fifa_api_id', 'birthday', 'height', 'weight']

info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11060 entries, 0 to 11059
Data columns (total 7 columns):
id                    11060 non-null int64
player_api_id         11060 non-null int64
player_name           11060 non-null object
player_fifa_api_id    11060 non-null int64
birthday              11060 non-null object
height                11060 non-null float64
weight                11060 non-null int64
dtypes: float64(1), int64(4), object(2)
memory usage: 604.9+ KB

describe():                  id  player_api_id  player_fifa_api_id        height  \
count  11060.000000   11060.000000        11060.000000  11060.000000   
mean    5537.511392  156582.427215       165664.910488    181.867445   
std     3197.692647  160713.700624        58649.928360      6.369201   
min        1.000000    2625.000000            2.000000    157.480000   
25%     2767.750000   35555.500000       151889.500000    177.800000   
50%     5536.500000   96619.500000       184671.000000    182.880000   
75%     8306.250000  212470.500000       203883.250000    185.420000   
max    11075.000000  750584.000000       234141.000000    208.280000   

             weight  
count  11060.000000  
mean     168.380289  
std       14.990217  
min      117.000000  
25%      159.000000  
50%      168.000000  
75%      179.000000  
max      243.000000  

Describe Categorical Attributes:        player_name             birthday
count        11060                11060
unique       10848                 5762
top         Danilo  1989-03-02 00:00:00
freq             7                   10

Player Attributes with Missing Values: id                    0
player_api_id         0
player_name           0
player_fifa_api_id    0
birthday              0
height                0
weight                0
dtype: int64

Player Members with Missing Values: 0        0
1        0
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        0
10       0
11       0
12       0
13       0
14       0
15       0
16       0
17       0
18       0
19       0
20       0
21       0
22       0
23       0
24       0
25       0
26       0
27       0
28       0
29       0
30       0
31       0
32       0
33       0
34       0
35       0
36       0
37       0
38       0
39       0
40       0
41       0
42       0
43       0
44       0
45       0
46       0
47       0
48       0
49       0
50       0
51       0
52       0
53       0
54       0
55       0
56       0
57       0
58       0
59       0
60       0
61       0
62       0
63       0
64       0
65       0
66       0
67       0
68       0
69       0
70       0
71       0
72       0
73       0
74       0
75       0
76       0
77       0
78       0
79       0
80       0
81       0
82       0
83       0
84       0
85       0
86       0
87       0
88       0
89       0
90       0
91       0
92       0
93       0
94       0
95       0
96       0
97       0
98       0
99       0
100      0
101      0
102      0
103      0
104      0
105      0
106      0
107      0
108      0
109      0
110      0
111      0
112      0
113      0
114      0
115      0
116      0
117      0
118      0
119      0
120      0
121      0
122      0
123      0
124      0
125      0
126      0
127      0
128      0
129      0
130      0
131      0
132      0
133      0
134      0
135      0
136      0
137      0
138      0
139      0
140      0
141      0
142      0
143      0
144      0
145      0
146      0
147      0
148      0
149      0
150      0
151      0
152      0
153      0
154      0
155      0
156      0
157      0
158      0
159      0
160      0
161      0
162      0
163      0
164      0
165      0
166      0
167      0
168      0
169      0
170      0
171      0
172      0
173      0
174      0
175      0
176      0
177      0
178      0
179      0
180      0
181      0
182      0
183      0
184      0
185      0
186      0
187      0
188      0
189      0
190      0
191      0
192      0
193      0
194      0
195      0
196      0
197      0
198      0
199      0
200      0
201      0
202      0
203      0
204      0
205      0
206      0
207      0
208      0
209      0
210      0
211      0
212      0
213      0
214      0
215      0
216      0
217      0
218      0
219      0
220      0
221      0
222      0
223      0
224      0
225      0
226      0
227      0
228      0
229      0
230      0
231      0
232      0
233      0
234      0
235      0
236      0
237      0
238      0
239      0
240      0
241      0
242      0
243      0
244      0
245      0
246      0
247      0
248      0
249      0
        ..
10810    0
10811    0
10812    0
10813    0
10814    0
10815    0
10816    0
10817    0
10818    0
10819    0
10820    0
10821    0
10822    0
10823    0
10824    0
10825    0
10826    0
10827    0
10828    0
10829    0
10830    0
10831    0
10832    0
10833    0
10834    0
10835    0
10836    0
10837    0
10838    0
10839    0
10840    0
10841    0
10842    0
10843    0
10844    0
10845    0
10846    0
10847    0
10848    0
10849    0
10850    0
10851    0
10852    0
10853    0
10854    0
10855    0
10856    0
10857    0
10858    0
10859    0
10860    0
10861    0
10862    0
10863    0
10864    0
10865    0
10866    0
10867    0
10868    0
10869    0
10870    0
10871    0
10872    0
10873    0
10874    0
10875    0
10876    0
10877    0
10878    0
10879    0
10880    0
10881    0
10882    0
10883    0
10884    0
10885    0
10886    0
10887    0
10888    0
10889    0
10890    0
10891    0
10892    0
10893    0
10894    0
10895    0
10896    0
10897    0
10898    0
10899    0
10900    0
10901    0
10902    0
10903    0
10904    0
10905    0
10906    0
10907    0
10908    0
10909    0
10910    0
10911    0
10912    0
10913    0
10914    0
10915    0
10916    0
10917    0
10918    0
10919    0
10920    0
10921    0
10922    0
10923    0
10924    0
10925    0
10926    0
10927    0
10928    0
10929    0
10930    0
10931    0
10932    0
10933    0
10934    0
10935    0
10936    0
10937    0
10938    0
10939    0
10940    0
10941    0
10942    0
10943    0
10944    0
10945    0
10946    0
10947    0
10948    0
10949    0
10950    0
10951    0
10952    0
10953    0
10954    0
10955    0
10956    0
10957    0
10958    0
10959    0
10960    0
10961    0
10962    0
10963    0
10964    0
10965    0
10966    0
10967    0
10968    0
10969    0
10970    0
10971    0
10972    0
10973    0
10974    0
10975    0
10976    0
10977    0
10978    0
10979    0
10980    0
10981    0
10982    0
10983    0
10984    0
10985    0
10986    0
10987    0
10988    0
10989    0
10990    0
10991    0
10992    0
10993    0
10994    0
10995    0
10996    0
10997    0
10998    0
10999    0
11000    0
11001    0
11002    0
11003    0
11004    0
11005    0
11006    0
11007    0
11008    0
11009    0
11010    0
11011    0
11012    0
11013    0
11014    0
11015    0
11016    0
11017    0
11018    0
11019    0
11020    0
11021    0
11022    0
11023    0
11024    0
11025    0
11026    0
11027    0
11028    0
11029    0
11030    0
11031    0
11032    0
11033    0
11034    0
11035    0
11036    0
11037    0
11038    0
11039    0
11040    0
11041    0
11042    0
11043    0
11044    0
11045    0
11046    0
11047    0
11048    0
11049    0
11050    0
11051    0
11052    0
11053    0
11054    0
11055    0
11056    0
11057    0
11058    0
11059    0
Length: 11060, dtype: int64
********************************************************************************
Analyzing "Match" Table >> DataFrame "dfMatch" [Dim:(25979, 115)]:

 DataFrame shape: (25979, 115)
 DataFrame columns: ['id', 'country_id', 'league_id', 'season', 'stage', 'date', 'match_api_id', 'home_team_api_id', 'away_team_api_id', 'home_team_goal', 'away_team_goal', 'home_player_X1', 'home_player_X2', 'home_player_X3', 'home_player_X4', 'home_player_X5', 'home_player_X6', 'home_player_X7', 'home_player_X8', 'home_player_X9', 'home_player_X10', 'home_player_X11', 'away_player_X1', 'away_player_X2', 'away_player_X3', 'away_player_X4', 'away_player_X5', 'away_player_X6', 'away_player_X7', 'away_player_X8', 'away_player_X9', 'away_player_X10', 'away_player_X11', 'home_player_Y1', 'home_player_Y2', 'home_player_Y3', 'home_player_Y4', 'home_player_Y5', 'home_player_Y6', 'home_player_Y7', 'home_player_Y8', 'home_player_Y9', 'home_player_Y10', 'home_player_Y11', 'away_player_Y1', 'away_player_Y2', 'away_player_Y3', 'away_player_Y4', 'away_player_Y5', 'away_player_Y6', 'away_player_Y7', 'away_player_Y8', 'away_player_Y9', 'away_player_Y10', 'away_player_Y11', 'home_player_1', 'home_player_2', 'home_player_3', 'home_player_4', 'home_player_5', 'home_player_6', 'home_player_7', 'home_player_8', 'home_player_9', 'home_player_10', 'home_player_11', 'away_player_1', 'away_player_2', 'away_player_3', 'away_player_4', 'away_player_5', 'away_player_6', 'away_player_7', 'away_player_8', 'away_player_9', 'away_player_10', 'away_player_11', 'goal', 'shoton', 'shotoff', 'foulcommit', 'card', 'cross', 'corner', 'possession', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD', 'BSA']

info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Columns: 115 entries, id to BSA
dtypes: float64(96), int64(9), object(10)
memory usage: 22.8+ MB

describe():                  id    country_id     league_id         stage  match_api_id  \
count  25979.000000  25979.000000  25979.000000  25979.000000  2.597900e+04   
mean   12990.000000  11738.630317  11738.630317     18.242773  1.195429e+06   
std     7499.635658   7553.936759   7553.936759     10.407354  4.946279e+05   
min        1.000000      1.000000      1.000000      1.000000  4.831290e+05   
25%     6495.500000   4769.000000   4769.000000      9.000000  7.684365e+05   
50%    12990.000000  10257.000000  10257.000000     18.000000  1.147511e+06   
75%    19484.500000  17642.000000  17642.000000     27.000000  1.709852e+06   
max    25979.000000  24558.000000  24558.000000     38.000000  2.216672e+06   

       home_team_api_id  away_team_api_id  home_team_goal  away_team_goal  \
count      25979.000000      25979.000000    25979.000000    25979.000000   
mean        9984.371993       9984.475115        1.544594        1.160938   
std        14087.453758      14087.445135        1.297158        1.142110   
min         1601.000000       1601.000000        0.000000        0.000000   
25%         8475.000000       8475.000000        1.000000        0.000000   
50%         8697.000000       8697.000000        1.000000        1.000000   
75%         9925.000000       9925.000000        2.000000        2.000000   
max       274581.000000     274581.000000       10.000000        9.000000   

       home_player_X1  home_player_X2  home_player_X3  home_player_X4  \
count    24158.000000    24158.000000    24147.000000    24147.000000   
mean         0.999586        2.073516        4.061001        6.049199   
std          0.022284        0.387185        0.385973        0.446024   
min          0.000000        0.000000        1.000000        2.000000   
25%          1.000000        2.000000        4.000000        6.000000   
50%          1.000000        2.000000        4.000000        6.000000   
75%          1.000000        2.000000        4.000000        6.000000   
max          2.000000        8.000000        8.000000        8.000000   

       home_player_X5  home_player_X6  home_player_X7  home_player_X8  \
count    24147.000000    24147.000000    24147.000000    24147.000000   
mean         7.544871        3.185158        4.769909        5.309604   
std          1.609682        1.233609        1.092386        1.687376   
min          1.000000        1.000000        1.000000        1.000000   
25%          8.000000        2.000000        4.000000        3.000000   
50%          8.000000        3.000000        5.000000        6.000000   
75%          8.000000        4.000000        6.000000        7.000000   
max          9.000000        9.000000        9.000000        9.000000   

       home_player_X9  home_player_X10  home_player_X11  away_player_X1  \
count    24147.000000     24147.000000     24147.000000    24147.000000   
mean         5.822048         5.389407         5.783244        1.000290   
std          1.968550         1.492068         0.759585        0.033438   
min          1.000000         1.000000         1.000000        1.000000   
25%          5.000000         4.000000         5.000000        1.000000   
50%          5.000000         5.000000         6.000000        1.000000   
75%          8.000000         7.000000         6.000000        1.000000   
max          9.000000         9.000000         7.000000        6.000000   

       away_player_X2  away_player_X3  away_player_X4  away_player_X5  \
count    24147.000000    24147.000000    24147.000000     24147.00000   
mean         2.074585        4.058558        6.052222         7.52549   
std          0.403572        0.387047        0.448413         1.63765   
min          1.000000        2.000000        1.000000         1.00000   
25%          2.000000        4.000000        6.000000         8.00000   
50%          2.000000        4.000000        6.000000         8.00000   
75%          2.000000        4.000000        6.000000         8.00000   
max          8.000000        9.000000        8.000000         9.00000   

       away_player_X6  away_player_X7  away_player_X8  away_player_X9  \
count    24147.000000    24147.000000    24147.000000    24146.000000   
mean         3.195387        4.742660        5.294115        5.807546   
std          1.284060        1.118226        1.687324        1.957116   
min          1.000000        1.000000        1.000000        1.000000   
25%          2.000000        4.000000        3.000000        5.000000   
50%          3.000000        5.000000        6.000000        5.000000   
75%          4.000000        6.000000        7.000000        8.000000   
max          9.000000        9.000000        9.000000        9.000000   

       away_player_X10  away_player_X11  home_player_Y1  home_player_Y2  \
count     24146.000000     24140.000000    24158.000000    24158.000000   
mean          5.476187         5.766114        0.999627        2.998634   
std           1.541351         0.763611        0.024916        0.064003   
min           1.000000         3.000000        0.000000        0.000000   
25%           4.000000         5.000000        1.000000        3.000000   
50%           5.000000         6.000000        1.000000        3.000000   
75%           7.000000         6.000000        1.000000        3.000000   
max           9.000000         8.000000        3.000000        3.000000   

       home_player_Y3  home_player_Y4  home_player_Y5  home_player_Y6  \
count    24147.000000    24147.000000    24147.000000    24147.000000   
mean         3.000083        3.000414        3.237214        6.476954   
std          0.012871        0.028777        0.940481        0.739988   
min          3.000000        3.000000        3.000000        3.000000   
25%          3.000000        3.000000        3.000000        6.000000   
50%          3.000000        3.000000        3.000000        7.000000   
75%          3.000000        3.000000        3.000000        7.000000   
max          5.000000        5.000000        8.000000        9.000000   

       home_player_Y7  home_player_Y8  home_player_Y9  home_player_Y10  \
count    24147.000000    24147.000000    24147.000000     24147.000000   
mean         6.672257        7.238953        8.026339         9.218868   
std          0.588560        0.588947        1.126462         1.073893   
min          3.000000        3.000000        1.000000         3.000000   
25%          6.000000        7.000000        7.000000         8.000000   
50%          7.000000        7.000000        8.000000        10.000000   
75%          7.000000        8.000000        8.000000        10.000000   
max          9.000000       10.000000       10.000000        11.000000   

       home_player_Y11  away_player_Y1  away_player_Y2  away_player_Y3  \
count      24147.00000    24147.000000         24147.0    24147.000000   
mean          10.43699        1.000248             3.0        3.000166   
std            0.50578        0.022292             0.0        0.025741   
min            1.00000        1.000000             3.0        3.000000   
25%           10.00000        1.000000             3.0        3.000000   
50%           10.00000        1.000000             3.0        3.000000   
75%           11.00000        1.000000             3.0        3.000000   
max           11.00000        3.000000             3.0        7.000000   

       away_player_Y4  away_player_Y5  away_player_Y6  away_player_Y7  \
count    24147.000000    24147.000000    24147.000000    24147.000000   
mean         3.000248        3.245124        6.469706        6.680292   
std          0.028779        0.955398        0.756391        0.586502   
min          3.000000        3.000000        3.000000        3.000000   
25%          3.000000        3.000000        6.000000        6.000000   
50%          3.000000        3.000000        7.000000        7.000000   
75%          3.000000        3.000000        7.000000        7.000000   
max          7.000000        9.000000       10.000000       10.000000   

       away_player_Y8  away_player_Y9  away_player_Y10  away_player_Y11  \
count    24147.000000    24146.000000     24146.000000     24140.000000   
mean         7.246366        8.022115         9.161186        10.455178   
std          0.579881        1.126282         1.102167         0.498745   
min          3.000000        5.000000         6.000000         7.000000   
25%          7.000000        7.000000         8.000000        10.000000   
50%          7.000000        8.000000        10.000000        10.000000   
75%          8.000000        8.000000        10.000000        11.000000   
max         10.000000       11.000000        11.000000        11.000000   

       home_player_1  home_player_2  home_player_3  home_player_4  \
count   24755.000000   24664.000000   24698.000000   24656.000000   
mean    76638.362432  106854.091996   91601.289457   94540.204494   
std     87556.226324  112619.266880  102703.003558  102789.967205   
min      2984.000000    2802.000000    2752.000000    2752.000000   
25%     30602.000000   32574.000000   30602.000000   30627.000000   
50%     38230.000000   42388.000000   39731.000000   41060.000000   
75%     96836.000000  159854.000000  128036.750000  145561.000000   
max    698273.000000  748432.000000  705484.000000  723037.000000   

       home_player_5  home_player_6  home_player_7  home_player_8  \
count   24663.000000   24654.000000   24752.000000   24670.000000   
mean   109527.842071  102308.768516   97287.636918  107290.992947   
std    112787.542376  111514.726490  107050.579262  114527.022010   
min      2752.000000    2625.000000    2625.000000    2625.000000   
25%     33579.000000   31037.000000   30895.000000   32751.000000   
50%     45996.000000   41467.000000   41432.000000   43319.000000   
75%    160243.000000  150944.000000  141699.000000  160243.000000   
max    733787.000000  750584.000000  692984.000000  693171.000000   

       home_player_9  home_player_10  home_player_11  away_player_1  \
count   24706.000000    24543.000000    24424.000000   24745.000000   
mean   111131.536914   105612.508373   103413.870660   76628.199879   
std    116221.869480   111528.262892   108761.188832   87428.953151   
min      2625.000000     2625.000000     2802.000000    2796.000000   
25%     33332.250000    32465.000000    32627.000000   30622.000000   
50%     45605.000000    43296.000000    42091.000000   38289.000000   
75%    164479.000000   158783.000000   161291.000000   96836.000000   
max    730065.000000   742405.000000   726956.000000  698273.000000   

       away_player_2  away_player_3  away_player_4  away_player_5  \
count   24701.000000   24686.000000   24658.000000   24644.000000   
mean   107614.617424   91126.781293   95083.910617  109800.936901   
std    113758.447918  101438.583414  103845.905029  112450.715516   
min      2790.000000    2752.000000    2752.000000    2790.000000   
25%     32579.000000   30464.000000   30627.000000   33453.500000   
50%     42388.000000   39892.000000   41083.000000   46212.000000   
75%    159882.000000  121080.000000  145561.000000  160844.000000   
max    748432.000000  705484.000000  728414.000000  746419.000000   

       away_player_6  away_player_7  away_player_8  away_player_9  \
count   24666.000000   24744.000000   24638.000000   24651.000000   
mean   102308.262588   97898.056014  109265.114214  111087.003448   
std    110861.276758  107359.030969  116358.092456  116492.957656   
min      2625.000000    2625.000000    2625.000000    2625.000000   
25%     31037.000000   30920.000000   32863.000000   33435.000000   
50%     41634.500000   41433.000000   45816.000000   45860.000000   
75%    151079.000000  144996.000000  163611.750000  164209.000000   
max    722766.000000  750435.000000  717248.000000  722766.000000   

       away_player_10  away_player_11         B365H         B365D  \
count    24538.000000    24425.000000  22592.000000  22592.000000   
mean    107149.115943   104932.876970      2.628818      3.839684   
std     112397.540326   110237.525957      1.794463      1.118269   
min       2770.000000     2802.000000      1.040000      1.400000   
25%      32627.000000    32747.000000      1.670000      3.300000   
50%      45358.000000    42652.000000      2.100000      3.500000   
75%     161291.000000   161660.000000      2.800000      4.000000   
max     722766.000000   726956.000000     26.000000     17.000000   

              B365A           BWH           BWD           BWA           IWH  \
count  22592.000000  22575.000000  22575.000000  22575.000000  22520.000000   
mean       4.662222      2.559245      3.747597      4.396949      2.467613   
std        3.730104      1.637983      1.033468      3.292265      1.445642   
min        1.080000      1.030000      1.650000      1.100000      1.030000   
25%        2.500000      1.650000      3.200000      2.500000      1.650000   
50%        3.500000      2.100000      3.400000      3.400000      2.100000   
75%        5.250000      2.750000      3.800000      5.000000      2.600000   
max       51.000000     34.000000     19.500000     51.000000     20.000000   

                IWD           IWA           LBH           LBD           LBA  \
count  22520.000000  22520.000000  22556.000000  22556.000000  22556.000000   
mean       3.608932      4.150575      2.536202      3.711740      4.385351   
std        0.803582      2.858912      1.640648      1.003846      3.398285   
min        1.500000      1.100000      1.040000      1.400000      1.100000   
25%        3.200000      2.500000      1.670000      3.200000      2.500000   
50%        3.300000      3.300000      2.100000      3.400000      3.300000   
75%        3.700000      4.600000      2.700000      3.750000      5.000000   
max       11.000000     25.000000     26.000000     19.000000     51.000000   

                PSH           PSD           PSA           WHH           WHD  \
count  11168.000000  11168.000000  11168.000000  22571.000000  22571.000000   
mean       2.816447      4.132324      4.972744      2.578737      3.665295   
std        2.187924      1.516680      4.475458      1.685725      0.958069   
min        1.040000      2.200000      1.090000      1.020000      1.020000   
25%        1.720000      3.410000      2.560000      1.670000      3.200000   
50%        2.200000      3.640000      3.610000      2.150000      3.300000   
75%        2.980000      4.230000      5.410000      2.750000      3.750000   
max       36.000000     29.000000     47.500000     26.000000     17.000000   

                WHA           SJH           SJD           SJA           VCH  \
count  22571.000000  17097.000000  17097.000000  17097.000000  22568.000000   
mean       4.482585      2.566061      3.755879      4.622343      2.668107   
std        3.585083      1.655535      0.999582      3.632164      1.928753   
min        1.080000      1.040000      1.400000      1.100000      1.030000   
25%        2.500000      1.670000      3.250000      2.500000      1.700000   
50%        3.400000      2.100000      3.400000      3.500000      2.150000   
75%        5.000000      2.750000      3.800000      5.250000      2.800000   
max       51.000000     23.000000     15.000000     41.000000     36.000000   

                VCD           VCA           GBH           GBD           GBA  \
count  22568.000000  22568.000000  14162.000000  14162.000000  14162.000000   
mean       3.899048      4.840281      2.498764      3.648189      4.353097   
std        1.248221      4.318338      1.489299      0.867440      3.010189   
min        1.620000      1.080000      1.050000      1.450000      1.120000   
25%        3.300000      2.550000      1.670000      3.200000      2.500000   
50%        3.500000      3.500000      2.100000      3.300000      3.400000   
75%        4.000000      5.400000      2.650000      3.750000      5.000000   
max       26.000000     67.000000     21.000000     11.000000     34.000000   

                BSH           BSD           BSA  
count  14161.000000  14161.000000  14161.000000  
mean       2.497894      3.660742      4.405663  
std        1.507793      0.868272      3.189814  
min        1.040000      1.330000      1.120000  
25%        1.670000      3.250000      2.500000  
50%        2.100000      3.400000      3.400000  
75%        2.620000      3.750000      5.000000  
max       17.000000     13.000000     34.000000  

Describe Categorical Attributes:            season                 date      goal      shoton      shotoff  \
count       25979                25979     14217       14217        14217   
unique          8                 1694     13225        8464         8464   
top     2008/2009  2009-04-11 00:00:00  <goal />  <shoton />  <shotoff />   
freq         3326                   60       993        5754         5754   

            foulcommit      card      cross      corner      possession  
count            14217     14217      14217       14217           14217  
unique            8466     13777       8466        8465            8420  
top     <foulcommit />  <card />  <cross />  <corner />  <possession />  
freq              5752       441       5752        5753            5798  

Match Attributes with Missing Values: id                      0
country_id              0
league_id               0
season                  0
stage                   0
date                    0
match_api_id            0
home_team_api_id        0
away_team_api_id        0
home_team_goal          0
away_team_goal          0
home_player_X1       1821
home_player_X2       1821
home_player_X3       1832
home_player_X4       1832
home_player_X5       1832
home_player_X6       1832
home_player_X7       1832
home_player_X8       1832
home_player_X9       1832
home_player_X10      1832
home_player_X11      1832
away_player_X1       1832
away_player_X2       1832
away_player_X3       1832
away_player_X4       1832
away_player_X5       1832
away_player_X6       1832
away_player_X7       1832
away_player_X8       1832
away_player_X9       1833
away_player_X10      1833
away_player_X11      1839
home_player_Y1       1821
home_player_Y2       1821
home_player_Y3       1832
home_player_Y4       1832
home_player_Y5       1832
home_player_Y6       1832
home_player_Y7       1832
home_player_Y8       1832
home_player_Y9       1832
home_player_Y10      1832
home_player_Y11      1832
away_player_Y1       1832
away_player_Y2       1832
away_player_Y3       1832
away_player_Y4       1832
away_player_Y5       1832
away_player_Y6       1832
away_player_Y7       1832
away_player_Y8       1832
away_player_Y9       1833
away_player_Y10      1833
away_player_Y11      1839
home_player_1        1224
home_player_2        1315
home_player_3        1281
home_player_4        1323
home_player_5        1316
home_player_6        1325
home_player_7        1227
home_player_8        1309
home_player_9        1273
home_player_10       1436
home_player_11       1555
away_player_1        1234
away_player_2        1278
away_player_3        1293
away_player_4        1321
away_player_5        1335
away_player_6        1313
away_player_7        1235
away_player_8        1341
away_player_9        1328
away_player_10       1441
away_player_11       1554
goal                11762
shoton              11762
shotoff             11762
foulcommit          11762
card                11762
cross               11762
corner              11762
possession          11762
B365H                3387
B365D                3387
B365A                3387
BWH                  3404
BWD                  3404
BWA                  3404
IWH                  3459
IWD                  3459
IWA                  3459
LBH                  3423
LBD                  3423
LBA                  3423
PSH                 14811
PSD                 14811
PSA                 14811
WHH                  3408
WHD                  3408
WHA                  3408
SJH                  8882
SJD                  8882
SJA                  8882
VCH                  3411
VCD                  3411
VCA                  3411
GBH                 11817
GBD                 11817
GBA                 11817
BSH                 11818
BSD                 11818
BSA                 11818
dtype: int64

Match Members with Missing Values: 0        77
1        77
2        77
3        77
4        77
5        80
6        77
7        77
8        77
9        77
10       77
11       77
12       77
13       77
14       77
15       77
16       77
17       77
18       77
19       77
20       77
21       77
22       77
23       77
24       77
25       77
26       77
27       77
28       77
29       77
30       77
31       77
32       77
33       77
34       77
35       77
36       77
37       77
38       77
39       77
40       77
41       77
42       77
43       77
44       77
45       77
46       77
47       77
48       77
49       77
50       77
51       77
52       77
53       77
54       77
55       77
56       77
57       77
58       77
59       77
60       77
61       77
62       77
63       77
64       77
65       77
66       77
67       77
68       77
69       77
70       77
71       77
72       77
73       77
74       77
75       77
76       77
77       77
78       77
79       77
80       77
81       77
82       77
83       77
84       77
85       77
86       77
87       77
88       77
89       77
90       77
91       77
92       77
93       77
94       77
95       77
96       77
97       77
98       77
99       77
100      77
101      77
102      77
103      77
104      77
105      77
106      77
107      77
108      77
109      77
110      77
111      77
112      77
113      77
114      77
115      77
116      77
117      77
118      77
119      77
120      77
121      77
122      77
123      77
124      77
125      77
126      77
127      77
128      77
129      77
130      77
131      77
132      77
133      77
134      77
135      77
136      77
137      77
138      77
139      77
140      77
141      77
142      77
143      77
144      16
145      11
146      17
147      12
148      13
149      12
150      13
151      14
152      12
153      11
154      14
155      11
156      13
157      12
158      12
159      22
160      13
161      14
162      11
163      12
164      41
165      15
166      12
167      12
168      11
169      16
170      16
171      12
172      14
173      11
174      15
175      14
176      11
177      17
178      12
179      12
180      12
181      20
182      13
183      12
184      13
185      12
186      14
187      13
188      12
189      11
190      11
191      12
192      12
193      13
194      14
195      17
196      15
197      77
198      77
199      77
200      77
201      77
202      77
203      77
204      77
205      77
206      77
207      12
208      15
209      13
210      18
211      13
212      12
213      14
214      13
215      13
216      12
217      38
218      12
219      11
220      15
221      12
222      15
223      13
224      13
225      13
226      11
227      15
228      12
229      11
230      12
231      11
232      12
233      16
234      11
235      12
236      15
237      12
238      13
239      13
240      11
241      14
242      11
243      11
244      11
245      13
246      14
247      12
248      15
249      15
         ..
25729    38
25730    38
25731    38
25732    38
25733    38
25734    38
25735    38
25736    38
25737    39
25738    39
25739    38
25740    38
25741    38
25742    38
25743    38
25744    38
25745    38
25746    39
25747    38
25748    38
25749    38
25750    39
25751    38
25752    38
25753    39
25754    38
25755    38
25756    38
25757    38
25758    38
25759    38
25760    38
25761    38
25762    38
25763    38
25764    38
25765    38
25766    39
25767    38
25768    38
25769    38
25770    38
25771    38
25772    38
25773    38
25774    38
25775    38
25776    38
25777    38
25778    39
25779    38
25780    38
25781    38
25782    39
25783    38
25784    38
25785    38
25786    39
25787    38
25788    38
25789    38
25790    38
25791    38
25792    38
25793    38
25794    38
25795    38
25796    38
25797    38
25798    38
25799    38
25800    38
25801    38
25802    40
25803    38
25804    40
25805    38
25806    38
25807    39
25808    38
25809    39
25810    38
25811    38
25812    38
25813    38
25814    38
25815    39
25816    38
25817    38
25818    38
25819    38
25820    38
25821    38
25822    40
25823    38
25824    38
25825    38
25826    38
25827    38
25828    38
25829    38
25830    38
25831    38
25832    38
25833    38
25834    38
25835    38
25836    38
25837    38
25838    39
25839    30
25840    30
25841    30
25842    30
25843    30
25844    32
25845    30
25846    30
25847    30
25848    30
25849    30
25850    30
25851    30
25852    30
25853    30
25854    38
25855    38
25856    40
25857    38
25858    38
25859    30
25860    30
25861    30
25862    30
25863    30
25864    30
25865    30
25866    30
25867    30
25868    30
25869    30
25870    30
25871    30
25872    30
25873    30
25874    30
25875    30
25876    30
25877    30
25878    30
25879    30
25880    30
25881    30
25882    30
25883    30
25884    30
25885    30
25886    30
25887    30
25888    30
25889    30
25890    30
25891    30
25892    30
25893    30
25894    30
25895    30
25896    30
25897    30
25898    30
25899    30
25900    30
25901    30
25902    30
25903    30
25904    30
25905    30
25906    30
25907    30
25908    30
25909    38
25910    38
25911    40
25912    38
25913    38
25914    30
25915    30
25916    30
25917    30
25918    30
25919    30
25920    30
25921    30
25922    30
25923    30
25924    30
25925    30
25926    30
25927    30
25928    30
25929    30
25930    30
25931    31
25932    30
25933    30
25934    30
25935    30
25936    30
25937    30
25938    30
25939    30
25940    30
25941    30
25942    30
25943    30
25944    30
25945    30
25946    30
25947    30
25948    30
25949    40
25950    38
25951    38
25952    39
25953    38
25954    38
25955    39
25956    38
25957    38
25958    38
25959    40
25960    38
25961    38
25962    38
25963    38
25964    39
25965    38
25966    38
25967    38
25968    38
25969    38
25970    38
25971    38
25972    38
25973    39
25974    38
25975    38
25976    38
25977    39
25978    38
Length: 25979, dtype: int64
********************************************************************************
Analyzing "League" Table >> DataFrame "dfLeague" [Dim:(11, 3)]:

 DataFrame shape: (11, 3)
 DataFrame columns: ['id', 'country_id', 'name']

info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
id            11 non-null int64
country_id    11 non-null int64
name          11 non-null object
dtypes: int64(2), object(1)
memory usage: 344.0+ bytes

describe():                  id    country_id
count     11.000000     11.000000
mean   12452.090909  12452.090909
std     8215.308472   8215.308472
min        1.000000      1.000000
25%     6289.000000   6289.000000
50%    13274.000000  13274.000000
75%    18668.000000  18668.000000
max    24558.000000  24558.000000

Describe Categorical Attributes:                   name
count               11
unique              11
top     France Ligue 1
freq                 1

League Attributes with Missing Values: id            0
country_id    0
name          0
dtype: int64

League Members with Missing Values: 0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64
********************************************************************************
Analyzing "Country" Table >> DataFrame "dfCountry" [Dim:(11, 2)]:

 DataFrame shape: (11, 2)
 DataFrame columns: ['id', 'name']

info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
id      11 non-null int64
name    11 non-null object
dtypes: int64(1), object(1)
memory usage: 256.0+ bytes

describe():                  id
count     11.000000
mean   12452.090909
std     8215.308472
min        1.000000
25%     6289.000000
50%    13274.000000
75%    18668.000000
max    24558.000000

Describe Categorical Attributes:             name
count         11
unique        11
top     Scotland
freq           1

Country Attributes with Missing Values: id      0
name    0
dtype: int64

Country Members with Missing Values: 0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
dtype: int64
********************************************************************************
Analyzing "Team" Table >> DataFrame "dfTeam" [Dim:(299, 5)]:

 DataFrame shape: (299, 5)
 DataFrame columns: ['id', 'team_api_id', 'team_fifa_api_id', 'team_long_name', 'team_short_name']

info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 5 columns):
id                  299 non-null int64
team_api_id         299 non-null int64
team_fifa_api_id    288 non-null float64
team_long_name      299 non-null object
team_short_name     299 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 11.8+ KB

describe():                  id    team_api_id  team_fifa_api_id
count    299.000000     299.000000        288.000000
mean   23735.301003   12340.521739      21534.305556
std    15167.914719   25940.411135      42456.439408
min        1.000000    1601.000000          1.000000
25%     9552.500000    8349.000000        178.750000
50%    22805.000000    8655.000000        673.500000
75%    36250.500000    9886.500000       1910.750000
max    51606.000000  274581.000000     112513.000000

Describe Categorical Attributes:        team_long_name team_short_name
count             299             299
unique            296             259
top     Polonia Bytom             GEN
freq                2               3

Team Attributes with Missing Values: id                   0
team_api_id          0
team_fifa_api_id    11
team_long_name       0
team_short_name      0
dtype: int64

Team Members with Missing Values: 0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      1
9      0
10     0
11     0
12     0
13     0
14     1
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
30     0
31     0
32     0
33     0
34     0
35     0
36     0
37     0
38     0
39     0
40     0
41     0
42     0
43     0
44     0
45     0
46     0
47     0
48     0
49     0
50     0
51     0
52     0
53     0
54     0
55     0
56     0
57     0
58     0
59     0
60     0
61     0
62     0
63     0
64     0
65     0
66     0
67     0
68     0
69     0
70     0
71     0
72     0
73     0
74     0
75     0
76     0
77     0
78     0
79     0
80     0
81     0
82     0
83     0
84     0
85     0
86     0
87     0
88     0
89     0
90     0
91     0
92     0
93     0
94     0
95     0
96     0
97     0
98     0
99     0
100    0
101    0
102    0
103    0
104    0
105    0
106    0
107    0
108    0
109    0
110    0
111    0
112    0
113    0
114    0
115    0
116    0
117    0
118    0
119    0
120    0
121    0
122    0
123    0
124    0
125    0
126    0
127    0
128    0
129    0
130    0
131    0
132    0
133    0
134    0
135    0
136    0
137    0
138    0
139    0
140    0
141    0
142    0
143    0
144    0
145    0
146    0
147    0
148    0
149    0
150    0
151    0
152    0
153    0
154    0
155    0
156    0
157    0
158    0
159    0
160    0
161    0
162    0
163    0
164    0
165    0
166    0
167    0
168    0
169    0
170    1
171    0
172    0
173    0
174    0
175    0
176    0
177    0
178    0
179    0
180    0
181    0
182    0
183    0
184    0
185    0
186    0
187    0
188    0
189    0
190    0
191    0
192    0
193    0
194    0
195    0
196    0
197    0
198    0
199    0
200    0
201    0
202    0
203    0
204    1
205    0
206    0
207    0
208    1
209    0
210    0
211    0
212    0
213    1
214    0
215    0
216    0
217    0
218    0
219    0
220    0
221    0
222    0
223    1
224    0
225    1
226    0
227    0
228    0
229    0
230    0
231    0
232    1
233    1
234    0
235    0
236    0
237    0
238    0
239    0
240    0
241    0
242    0
243    0
244    0
245    0
246    0
247    0
248    0
249    0
250    0
251    0
252    0
253    0
254    0
255    0
256    0
257    0
258    0
259    0
260    0
261    0
262    0
263    0
264    0
265    0
266    0
267    0
268    0
269    0
270    0
271    0
272    0
273    0
274    0
275    0
276    0
277    0
278    0
279    0
280    0
281    0
282    0
283    0
284    0
285    0
286    0
287    0
288    0
289    0
290    0
291    0
292    0
293    0
294    0
295    0
296    0
297    0
298    1
dtype: int64
********************************************************************************
In [7]:
## Use head() - show the first five records in all dataframes 

for tbl in tblNames:
  '''A quick peek into each dataframe with header()'''
  dfTblName = "df" + tbl
  print('Analyzing "{}" Table >> DataFrame "{}" [Dim:{}]:\n'.format(tbl, dfTblName, DFs[dfTblName].shape))
  print(DFs[dfTblName].head())
  print(80 * "*" + "\n")
Analyzing "sqlite_sequence" Table >> DataFrame "dfsqlite_sequence" [Dim:(7, 2)]:

      name     seq
0     Team  103916
1  Country   51958
2   League   51958
3    Match   51958
4   Player   11075
********************************************************************************

Analyzing "Player_Attributes" Table >> DataFrame "dfPlayer_Attributes" [Dim:(183978, 42)]:

   id  player_fifa_api_id  player_api_id                 date  overall_rating  \
0   1              218353         505942  2016-02-18 00:00:00            67.0   
1   2              218353         505942  2015-11-19 00:00:00            67.0   
2   3              218353         505942  2015-09-21 00:00:00            62.0   
3   4              218353         505942  2015-03-20 00:00:00            61.0   
4   5              218353         505942  2007-02-22 00:00:00            61.0   

   potential preferred_foot attacking_work_rate defensive_work_rate  crossing  \
0       71.0          right              medium              medium      49.0   
1       71.0          right              medium              medium      49.0   
2       66.0          right              medium              medium      49.0   
3       65.0          right              medium              medium      48.0   
4       65.0          right              medium              medium      48.0   

   finishing  heading_accuracy  short_passing  volleys  dribbling  curve  \
0       44.0              71.0           61.0     44.0       51.0   45.0   
1       44.0              71.0           61.0     44.0       51.0   45.0   
2       44.0              71.0           61.0     44.0       51.0   45.0   
3       43.0              70.0           60.0     43.0       50.0   44.0   
4       43.0              70.0           60.0     43.0       50.0   44.0   

   free_kick_accuracy  long_passing  ball_control  acceleration  sprint_speed  \
0                39.0          64.0          49.0          60.0          64.0   
1                39.0          64.0          49.0          60.0          64.0   
2                39.0          64.0          49.0          60.0          64.0   
3                38.0          63.0          48.0          60.0          64.0   
4                38.0          63.0          48.0          60.0          64.0   

   agility  reactions  balance  shot_power  jumping  stamina  strength  \
0     59.0       47.0     65.0        55.0     58.0     54.0      76.0   
1     59.0       47.0     65.0        55.0     58.0     54.0      76.0   
2     59.0       47.0     65.0        55.0     58.0     54.0      76.0   
3     59.0       46.0     65.0        54.0     58.0     54.0      76.0   
4     59.0       46.0     65.0        54.0     58.0     54.0      76.0   

   long_shots  aggression  interceptions  positioning  vision  penalties  \
0        35.0        71.0           70.0         45.0    54.0       48.0   
1        35.0        71.0           70.0         45.0    54.0       48.0   
2        35.0        63.0           41.0         45.0    54.0       48.0   
3        34.0        62.0           40.0         44.0    53.0       47.0   
4        34.0        62.0           40.0         44.0    53.0       47.0   

   marking  standing_tackle  sliding_tackle  gk_diving  gk_handling  \
0     65.0             69.0            69.0        6.0         11.0   
1     65.0             69.0            69.0        6.0         11.0   
2     65.0             66.0            69.0        6.0         11.0   
3     62.0             63.0            66.0        5.0         10.0   
4     62.0             63.0            66.0        5.0         10.0   

   gk_kicking  gk_positioning  gk_reflexes  
0        10.0             8.0          8.0  
1        10.0             8.0          8.0  
2        10.0             8.0          8.0  
3         9.0             7.0          7.0  
4         9.0             7.0          7.0  
********************************************************************************

Analyzing "Player" Table >> DataFrame "dfPlayer" [Dim:(11060, 7)]:

   id  player_api_id         player_name  player_fifa_api_id  \
0   1         505942  Aaron Appindangoye              218353   
1   2         155782     Aaron Cresswell              189615   
2   3         162549         Aaron Doran              186170   
3   4          30572       Aaron Galindo              140161   
4   5          23780        Aaron Hughes               17725   

              birthday  height  weight  
0  1992-02-29 00:00:00  182.88     187  
1  1989-12-15 00:00:00  170.18     146  
2  1991-05-13 00:00:00  170.18     163  
3  1982-05-08 00:00:00  182.88     198  
4  1979-11-08 00:00:00  182.88     154  
********************************************************************************

Analyzing "Match" Table >> DataFrame "dfMatch" [Dim:(25979, 115)]:

   id  country_id  league_id     season  stage                 date  \
0   1           1          1  2008/2009      1  2008-08-17 00:00:00   
1   2           1          1  2008/2009      1  2008-08-16 00:00:00   
2   3           1          1  2008/2009      1  2008-08-16 00:00:00   
3   4           1          1  2008/2009      1  2008-08-17 00:00:00   
4   5           1          1  2008/2009      1  2008-08-16 00:00:00   

   match_api_id  home_team_api_id  away_team_api_id  home_team_goal  \
0        492473              9987              9993               1   
1        492474             10000              9994               0   
2        492475              9984              8635               0   
3        492476              9991              9998               5   
4        492477              7947              9985               1   

   away_team_goal  home_player_X1  home_player_X2  home_player_X3  \
0               1             NaN             NaN             NaN   
1               0             NaN             NaN             NaN   
2               3             NaN             NaN             NaN   
3               0             NaN             NaN             NaN   
4               3             NaN             NaN             NaN   

   home_player_X4  home_player_X5  home_player_X6  home_player_X7  \
0             NaN             NaN             NaN             NaN   
1             NaN             NaN             NaN             NaN   
2             NaN             NaN             NaN             NaN   
3             NaN             NaN             NaN             NaN   
4             NaN             NaN             NaN             NaN   

   home_player_X8  home_player_X9  home_player_X10  home_player_X11  \
0             NaN             NaN              NaN              NaN   
1             NaN             NaN              NaN              NaN   
2             NaN             NaN              NaN              NaN   
3             NaN             NaN              NaN              NaN   
4             NaN             NaN              NaN              NaN   

   away_player_X1  away_player_X2  away_player_X3  away_player_X4  \
0             NaN             NaN             NaN             NaN   
1             NaN             NaN             NaN             NaN   
2             NaN             NaN             NaN             NaN   
3             NaN             NaN             NaN             NaN   
4             NaN             NaN             NaN             NaN   

   away_player_X5  away_player_X6  away_player_X7  away_player_X8  \
0             NaN             NaN             NaN             NaN   
1             NaN             NaN             NaN             NaN   
2             NaN             NaN             NaN             NaN   
3             NaN             NaN             NaN             NaN   
4             NaN             NaN             NaN             NaN   

   away_player_X9  away_player_X10  away_player_X11  home_player_Y1  \
0             NaN              NaN              NaN             NaN   
1             NaN              NaN              NaN             NaN   
2             NaN              NaN              NaN             NaN   
3             NaN              NaN              NaN             NaN   
4             NaN              NaN              NaN             NaN   

   home_player_Y2  home_player_Y3  home_player_Y4  home_player_Y5  \
0             NaN             NaN             NaN             NaN   
1             NaN             NaN             NaN             NaN   
2             NaN             NaN             NaN             NaN   
3             NaN             NaN             NaN             NaN   
4             NaN             NaN             NaN             NaN   

   home_player_Y6  home_player_Y7  home_player_Y8  home_player_Y9  \
0             NaN             NaN             NaN             NaN   
1             NaN             NaN             NaN             NaN   
2             NaN             NaN             NaN             NaN   
3             NaN             NaN             NaN             NaN   
4             NaN             NaN             NaN             NaN   

   home_player_Y10  home_player_Y11  away_player_Y1  away_player_Y2  \
0              NaN              NaN             NaN             NaN   
1              NaN              NaN             NaN             NaN   
2              NaN              NaN             NaN             NaN   
3              NaN              NaN             NaN             NaN   
4              NaN              NaN             NaN             NaN   

   away_player_Y3  away_player_Y4  away_player_Y5  away_player_Y6  \
0             NaN             NaN             NaN             NaN   
1             NaN             NaN             NaN             NaN   
2             NaN             NaN             NaN             NaN   
3             NaN             NaN             NaN             NaN   
4             NaN             NaN             NaN             NaN   

   away_player_Y7  away_player_Y8  away_player_Y9  away_player_Y10  \
0             NaN             NaN             NaN              NaN   
1             NaN             NaN             NaN              NaN   
2             NaN             NaN             NaN              NaN   
3             NaN             NaN             NaN              NaN   
4             NaN             NaN             NaN              NaN   

   away_player_Y11  home_player_1  home_player_2  home_player_3  \
0              NaN            NaN            NaN            NaN   
1              NaN            NaN            NaN            NaN   
2              NaN            NaN            NaN            NaN   
3              NaN            NaN            NaN            NaN   
4              NaN            NaN            NaN            NaN   

   home_player_4  home_player_5  home_player_6  home_player_7  home_player_8  \
0            NaN            NaN            NaN            NaN            NaN   
1            NaN            NaN            NaN            NaN            NaN   
2            NaN            NaN            NaN            NaN            NaN   
3            NaN            NaN            NaN            NaN            NaN   
4            NaN            NaN            NaN            NaN            NaN   

   home_player_9  home_player_10  home_player_11  away_player_1  \
0            NaN             NaN             NaN            NaN   
1            NaN             NaN             NaN            NaN   
2            NaN             NaN             NaN            NaN   
3            NaN             NaN             NaN            NaN   
4            NaN             NaN             NaN            NaN   

   away_player_2  away_player_3  away_player_4  away_player_5  away_player_6  \
0            NaN            NaN            NaN            NaN            NaN   
1            NaN            NaN            NaN            NaN            NaN   
2            NaN            NaN            NaN            NaN            NaN   
3            NaN            NaN            NaN            NaN            NaN   
4            NaN            NaN            NaN            NaN            NaN   

   away_player_7  away_player_8  away_player_9  away_player_10  \
0            NaN            NaN            NaN             NaN   
1            NaN            NaN            NaN             NaN   
2            NaN            NaN            NaN             NaN   
3            NaN            NaN            NaN             NaN   
4            NaN            NaN            NaN             NaN   

   away_player_11  goal shoton shotoff foulcommit  card cross corner  \
0             NaN  None   None    None       None  None  None   None   
1             NaN  None   None    None       None  None  None   None   
2             NaN  None   None    None       None  None  None   None   
3             NaN  None   None    None       None  None  None   None   
4             NaN  None   None    None       None  None  None   None   

  possession  B365H  B365D  B365A   BWH   BWD   BWA   IWH  IWD  IWA   LBH  \
0       None   1.73   3.40   5.00  1.75  3.35  4.20  1.85  3.2  3.5  1.80   
1       None   1.95   3.20   3.60  1.80  3.30  3.95  1.90  3.2  3.5  1.90   
2       None   2.38   3.30   2.75  2.40  3.30  2.55  2.60  3.1  2.3  2.50   
3       None   1.44   3.75   7.50  1.40  4.00  6.80  1.40  3.9  6.0  1.44   
4       None   5.00   3.50   1.65  5.00  3.50  1.60  4.00  3.3  1.7  4.00   

   LBD   LBA  PSH  PSD  PSA   WHH   WHD   WHA   SJH  SJD   SJA   VCH   VCD  \
0  3.3  3.75  NaN  NaN  NaN  1.70  3.30  4.33  1.90  3.3  4.00  1.65  3.40   
1  3.2  3.50  NaN  NaN  NaN  1.83  3.30  3.60  1.95  3.3  3.80  2.00  3.25   
2  3.2  2.50  NaN  NaN  NaN  2.50  3.25  2.40  2.63  3.3  2.50  2.35  3.25   
3  3.6  6.50  NaN  NaN  NaN  1.44  3.75  6.00  1.44  4.0  7.50  1.45  3.75   
4  3.4  1.72  NaN  NaN  NaN  4.20  3.40  1.70  4.50  3.5  1.73  4.50  3.40   

    VCA   GBH   GBD   GBA   BSH   BSD   BSA  
0  4.50  1.78  3.25  4.00  1.73  3.40  4.20  
1  3.25  1.85  3.25  3.75  1.91  3.25  3.60  
2  2.65  2.50  3.20  2.50  2.30  3.20  2.75  
3  6.50  1.50  3.75  5.50  1.44  3.75  6.50  
4  1.65  4.50  3.50  1.65  4.75  3.30  1.67  
********************************************************************************

Analyzing "League" Table >> DataFrame "dfLeague" [Dim:(11, 3)]:

      id  country_id                    name
0      1           1  Belgium Jupiler League
1   1729        1729  England Premier League
2   4769        4769          France Ligue 1
3   7809        7809   Germany 1. Bundesliga
4  10257       10257           Italy Serie A
********************************************************************************

Analyzing "Country" Table >> DataFrame "dfCountry" [Dim:(11, 2)]:

      id     name
0      1  Belgium
1   1729  England
2   4769   France
3   7809  Germany
4  10257    Italy
********************************************************************************

Analyzing "Team" Table >> DataFrame "dfTeam" [Dim:(299, 5)]:

   id  team_api_id  team_fifa_api_id     team_long_name team_short_name
0   1         9987             673.0           KRC Genk             GEN
1   2         9993             675.0       Beerschot AC             BAC
2   3        10000           15005.0   SV Zulte-Waregem             ZUL
3   4         9994            2007.0   Sporting Lokeren             LOK
4   5         9984            1750.0  KSV Cercle Brugge             CEB
********************************************************************************

Analyzing "Team_Attributes" Table >> DataFrame "dfTeam_Attributes" [Dim:(1458, 25)]:

   id  team_fifa_api_id  team_api_id                 date  buildUpPlaySpeed  \
0   1               434         9930  2010-02-22 00:00:00                60   
1   2               434         9930  2014-09-19 00:00:00                52   
2   3               434         9930  2015-09-10 00:00:00                47   
3   4                77         8485  2010-02-22 00:00:00                70   
4   5                77         8485  2011-02-22 00:00:00                47   

  buildUpPlaySpeedClass  buildUpPlayDribbling buildUpPlayDribblingClass  \
0              Balanced                   NaN                    Little   
1              Balanced                  48.0                    Normal   
2              Balanced                  41.0                    Normal   
3                  Fast                   NaN                    Little   
4              Balanced                   NaN                    Little   

   buildUpPlayPassing buildUpPlayPassingClass buildUpPlayPositioningClass  \
0                  50                   Mixed                   Organised   
1                  56                   Mixed                   Organised   
2                  54                   Mixed                   Organised   
3                  70                    Long                   Organised   
4                  52                   Mixed                   Organised   

   chanceCreationPassing chanceCreationPassingClass  chanceCreationCrossing  \
0                     60                     Normal                      65   
1                     54                     Normal                      63   
2                     54                     Normal                      63   
3                     70                      Risky                      70   
4                     53                     Normal                      48   

  chanceCreationCrossingClass  chanceCreationShooting  \
0                      Normal                      55   
1                      Normal                      64   
2                      Normal                      64   
3                        Lots                      70   
4                      Normal                      52   

  chanceCreationShootingClass chanceCreationPositioningClass  defencePressure  \
0                      Normal                      Organised               50   
1                      Normal                      Organised               47   
2                      Normal                      Organised               47   
3                        Lots                      Organised               60   
4                      Normal                      Organised               47   

  defencePressureClass  defenceAggression defenceAggressionClass  \
0               Medium                 55                  Press   
1               Medium                 44                  Press   
2               Medium                 44                  Press   
3               Medium                 70                 Double   
4               Medium                 47                  Press   

   defenceTeamWidth defenceTeamWidthClass defenceDefenderLineClass  
0                45                Normal                    Cover  
1                54                Normal                    Cover  
2                54                Normal                    Cover  
3                70                  Wide                    Cover  
4                52                Normal                    Cover  
********************************************************************************

Analyzing Player Table

  • Derive two auxiliary features, BMI & Age, from Player dataframe.
  • Use Distributions and Horizontal Lines plots to visualize these derived features along with two native ones, Weight & Height.
In [8]:
## Derive two auxiliary features, BMI & Age, 

DFs['dfPlayer']["weight_kg"] = DFs['dfPlayer']["weight"] * 0.453592
DFs['dfPlayer']["height_m"]  = DFs['dfPlayer']["height"] / 100

## Derive BMI:
DFs['dfPlayer']["bmi"]      = DFs['dfPlayer']["weight_kg"]/(DFs['dfPlayer']["height_m"] * DFs['dfPlayer']["height_m"])

## Derive Age from year in a player's birthday attribute:
DFs['dfPlayer']["year"]  = pd.DatetimeIndex(DFs['dfPlayer']["birthday"]).year
DFs['dfPlayer']["age"]   = ThisYear - DFs['dfPlayer']["year"]


### Normal Distribution of selected Player Attributes
cols  = ["bmi", "weight_kg", "height_m", "age"]
length = len(cols)
cplt01 = ["b","r","k","c"]  ## color palette
plt.figure(figsize=(14, 8))

for i, j, k in itertools.zip_longest(cols, range(length), cplt01): 
  plt.subplot(2,2,j+1)
  sns.distplot(DFs['dfPlayer'][i], color=k)    ## distribution plot()
  plt.axvline(DFs['dfPlayer'][i].mean(), color = "k", linestyle = "dashed", label="mean")
  plt.legend(loc="best")
  plt.title(i)
  plt.xlabel("")

## Horizontal lines diagrams on Weight, Height, BMI and Ages:
cplt02 = ["b","r","g","c"]

plt.figure(figsize=(13, 12))

for i, j, k in itertools.zip_longest(cols, range(length), cplt02):
  plt.subplot(4,1,j+1)
  DFs['dfPlayer'][i].plot(color = k, linewidth =.2, label = i)  ## plot()
  plt.axhline(DFs['dfPlayer'][i].mean(), color = "k", linestyle = "dashed", label="mean")
  plt.legend(loc="best")
  plt.subplots_adjust(hspace = .3)
  plt.xlabel("player index")
  plt.ylabel(i)
  plt.title(i)

Players Min/Max:

In [9]:
print ("Min/Max of PLAYER ATTRIBUTES")
print ("=" * 80)

### Age: Oldest & Youngest players
dfOldestPlayers = DFs['dfPlayer'].loc[DFs['dfPlayer']["age"].idxmax()]
dfYoungestPlayers = DFs['dfPlayer'].loc[ DFs['dfPlayer']["age"].idxmin() ]

print("Oldest Player is {} of age {:3.0f} years old.".format( ", ".join(
  list( DFs['dfPlayer'][ DFs['dfPlayer']["age"] == dfOldestPlayers["age"] ]["player_name"])),
                                                             dfOldestPlayers["age"]))

print("Youngest Players are {} of age {:3.0f} years old.".format( ", ".join(
  list( DFs['dfPlayer'][DFs['dfPlayer']["age"] == dfYoungestPlayers["age"] ]["player_name"])),
                                                                 dfYoungestPlayers["age"] ))


### Height: Tallest & Shortest players
dfTallestPlayers = DFs['dfPlayer'].loc[ DFs['dfPlayer']["height_m"].idxmax()]
dfShortestPlayers = DFs['dfPlayer'].loc[ DFs['dfPlayer']["height_m"].idxmin() ]

print("Tallest Player is {} of Height {:2.2f} meters.".format( ", ".join( list( DFs['dfPlayer'][
  DFs['dfPlayer']["height_m"] == dfTallestPlayers["height_m"]]["player_name"])) , dfTallestPlayers["height_m"] ))

print("Shortest Player is {} of Height {:2.2f} meters.".format( ", ".join(
  list( DFs['dfPlayer'][ DFs['dfPlayer']["height_m"] == dfShortestPlayers["height_m"]]["player_name"])),
                                                               dfShortestPlayers["height_m"] ))


### Weight: Heaviest & Lightest players
dfHeaviestPlayers = DFs['dfPlayer'].loc[DFs['dfPlayer']["weight_kg"].idxmax()]
dfLightestPlayers = DFs['dfPlayer'].loc[DFs['dfPlayer']["weight_kg"].idxmin()]

print("Player with Highest weight is {} of Weight {:3.2f} kilograms".format( ", ".join(
  list( DFs['dfPlayer'][ DFs['dfPlayer']["weight_kg"] == dfHeaviestPlayers["weight_kg"] ]["player_name"])),
                                                                            dfHeaviestPlayers["weight_kg"]))

print("Player with Lowest weight is {} of Weight {:3.2f} kilograms".format( ", ".join(
  list( DFs['dfPlayer'][ DFs['dfPlayer']["weight_kg"] == dfLightestPlayers["weight_kg"] ]["player_name"])),
                                                                           dfLightestPlayers["weight_kg"] ))


### BMI: Max & min BMI players
dfMaxBMIPlayers = DFs['dfPlayer'].loc[DFs['dfPlayer']["bmi"].idxmax()]
dfMinBMIPlayers = DFs['dfPlayer'].loc[DFs['dfPlayer']["bmi"].idxmin()]

print("Player with Highest Body Mass Index is {} of {:3.2f} kg/m2".format( ", ".join(
  list( DFs['dfPlayer'][ DFs['dfPlayer']["bmi"] == dfMaxBMIPlayers["bmi"]]["player_name"])),
                                                                          dfMaxBMIPlayers["bmi"]  ))

print("Player with Lowest Body Mass Index is {} of {:3.2f} kg/m2".format( ", ".join(
  list( DFs['dfPlayer'][ DFs['dfPlayer']["bmi"] == dfMinBMIPlayers["bmi"]]["player_name"])),
                                                                         dfMinBMIPlayers["bmi"]))
Min/Max of PLAYER ATTRIBUTES
================================================================================
Oldest Player is Alberto Fontana of age  52 years old.
Youngest Players are Alban Lafont, Gianluigi Donnarumma, Jonathan Leko of age  20 years old.
Tallest Player is Kristof van Hout of Height 2.08 meters.
Shortest Player is Juan Quero of Height 1.57 meters.
Player with Highest weight is Kristof van Hout, Tim Wiese of Weight 110.22 kilograms
Player with Lowest weight is Juan Quero of Weight 53.07 kilograms
Player with Highest Body Mass Index is Sylvan Ebanks-Blake of 30.87 kg/m2
Player with Lowest Body Mass Index is John Stewart of 17.59 kg/m2

Heatmap - Statistics on 4 Selected Player Attributes

In [10]:
## Pandas describe() reports the Statistics profile of a sample dataset including Mean, Standard Deviation, Min/Max,
## 25%, 50% and 75%. I collected a handful of player physical attributes and used Seaborn Heatmap to get an idea
## at high level.

plt.figure(figsize=(13,6))

sns.heatmap(DFs['dfPlayer'][["height", "weight", "weight_kg", "age", "bmi"]].describe()[1:].transpose(),
            annot=True, fmt="f", linecolor="white", linewidths=2)

plt.title("Four Player Attributes Summary")
plt.show()

Heatmap - Correlation Among Numerical Attributes in Player DataFrame

Correlation coefficient of two variables in a data sample is a normalized measurement of how they are linearly related. It provides a clue to tell how strong two variables are related provided the data sample is normal distribution.

Seaborn heatmap offers an intuitive way to reveal the correlation among features pairwise.

In [11]:
## Correlation HeadMap:

dfPlayerNumAttrs = getNumAttrs(DFs['dfPlayer'])
corrPlayerNumAttrs = dfPlayerNumAttrs.loc[:, dfPlayerNumAttrs.columns[3:]].corr()

plt.figure(figsize=(13,8))
sns.heatmap(corrPlayerNumAttrs, annot=True, fmt="f", linecolor="k", linewidths=2, cmap = sns.color_palette("Set2"))

plt.title("Correlation Among Numerical Attributes of Player DataFrame")
plt.show()

Scatter PairPlot Matrix - Correction Among Numerical Attributes in Players:

In [12]:
sns.pairplot( dfPlayerNumAttrs.loc[:, dfPlayerNumAttrs.columns[3:]] );

Scatter Plots on Age vs Weight, Height and BMI

The Scatterplot Matrix above showed potential correlation among Weight, Height and BMI. I'd like to look in further whether Age may have any correlation with any of these. I used Scatter plot to find out the correlation.

No apparent correclation was found.

In [13]:
### Scatter Plot for Age vs Weight, Height and BMI
cols = [ 'weight_kg', 'height_m', 'bmi']
length=len(cols)
plt.figure(figsize=(8,18))

for i, j in itertools.zip_longest(cols, range(length)):
  plt.subplot(3, 1, j+1)
  plt.scatter(DFs['dfPlayer']["age"], DFs['dfPlayer'][i], color="Orange", s= 1)

  plt.xlabel("age")
  plt.ylabel(i)
  plt.title(i)

Bar Chart to view Counts on the Initials of Players' First and Last Names:

In [14]:
### Counts on Player Name Initials:
firstIniCount = pd.DataFrame(DFs['dfPlayer']["player_name"].str.split(" ").str[0].str.upper().str[:1].value_counts())
lastIniCount = pd.DataFrame(DFs['dfPlayer']["player_name"].str.split(" ").str[1].str.upper().str[:1].value_counts())
Inis = firstIniCount.merge(lastIniCount, left_index=True, right_index=True, how="left").reset_index()
Inis = Inis.rename(columns={"index":"letter", "player_name_x":'first_name', "player_name_y":"last_name"}).sort_values(
  by="letter", ascending = True)

plt.figure(figsize=(14, 8))

plt.subplot(211)
sns.barplot(Inis["letter"], Inis["first_name"], color="b", linewidth = 1, edgecolor = "k"*26)
plt.ylabel("Count")
plt.xlabel("Alphabet")
plt.title("Value Count on FIRST NAME Initial")

plt.subplot(212)
sns.barplot(Inis["letter"], Inis["last_name"], color="r", linewidth = 1, edgecolor = "k"*26)
plt.ylabel("Count")
plt.title("Value Count on LAST NAME Initial")
plt.show()

Extract Numerical Attributes to a New DataFrame

In [15]:
dfNumPlayerAttrs = getNumAttrs(DFs['dfPlayer_Attributes'])
dfNumPlayerAttrs.head().loc[:, dfNumPlayerAttrs.columns[3:]].head()
Out[15]:
overall_rating potential crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy long_passing ball_control acceleration sprint_speed agility reactions balance shot_power jumping stamina strength long_shots aggression interceptions positioning vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 67.0 71.0 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
1 67.0 71.0 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 71.0 70.0 45.0 54.0 48.0 65.0 69.0 69.0 6.0 11.0 10.0 8.0 8.0
2 62.0 66.0 49.0 44.0 71.0 61.0 44.0 51.0 45.0 39.0 64.0 49.0 60.0 64.0 59.0 47.0 65.0 55.0 58.0 54.0 76.0 35.0 63.0 41.0 45.0 54.0 48.0 65.0 66.0 69.0 6.0 11.0 10.0 8.0 8.0
3 61.0 65.0 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0
4 61.0 65.0 48.0 43.0 70.0 60.0 43.0 50.0 44.0 38.0 63.0 48.0 60.0 64.0 59.0 46.0 65.0 54.0 58.0 54.0 76.0 34.0 62.0 40.0 44.0 53.0 47.0 62.0 63.0 66.0 5.0 10.0 9.0 7.0 7.0

HeatMap - Correlation among Numerical features.

We should see a positive correlation between the attacking features, a positive correlation between the defensive features and a negative correlation between the attacking and defensive features.

In [16]:
## Correlation matrix of Numerical fields on Player Attributes, excluding the First 3 ID columns:
corr2 = dfNumPlayerAttrs.loc[:, dfNumPlayerAttrs.columns[3:]].corr()

fig2,ax2 = plt.subplots(nrows = 1,ncols = 1)
fig2.set_size_inches(w=24, h=24)

sns.heatmap(corr2,annot = True, linewidths=0.5, ax = ax2);

Extract Categorical Attributes to a New DataFrame

  • A home-made function, getCatAttrs(), extracts categorical attributes in Player Attributes to a new dataframe.
  • Cardinality is a measure of uniqueness and is often applied on categorical attributes to determine whether any of them can be
  • All four cardinalities, 0.107%, 0.002%, 0.004% and 0.011%, are very small which represent these attributes have very limited distinct values. Their values share a lot in common.
In [17]:
## Find Cardinality among categorical features in Player Attributes dataframe:

## Retrieve Categorical Attributes from Player Attributes to a new DataFrame:
dfCatPlayerAttrs = getCatAttrs(DFs['dfPlayer_Attributes'])

## Show Cardinality of categotical attributes on Player Attributes dataframe:
for catattr in getColNames(dfCatPlayerAttrs):

  cardi = 100 * ( 
    len(np.unique( dfCatPlayerAttrs.loc[:, catattr].astype(str)))) / len(dfCatPlayerAttrs.loc[:, catattr].astype(str))
  
  print("Cardinality of {} - {:0.3f}%".format(catattr, cardi))
Cardinality of date - 0.107%
Cardinality of preferred_foot - 0.002%
Cardinality of attacking_work_rate - 0.004%
Cardinality of defensive_work_rate - 0.011%

Count Plots on Preferred_Foot, Attacking_Work_Rate, and Defensive_Work_Rate

Use Count plots to get an idea about three Categorical attributes, preferred_foot, attacking_work_rate, and defensive_work_rate. The intention is to figure out:

  • which columns might have abnormal attacking and defensive work rate.
  • Is there a correlation between attacking and defensive work rate values? If abnormal, how so?

The charts were followed by their corresponding statistics to streamline the comparison.

In [18]:
fig3, ax3 = plt.subplots(nrows = 1, ncols = 3)
fig3.set_size_inches(12, 3)

sns.countplot(x = DFs['dfPlayer_Attributes']['preferred_foot'], ax = ax3[0])
sns.countplot(x = DFs['dfPlayer_Attributes']['attacking_work_rate'], ax = ax3[1])
sns.countplot(x = DFs['dfPlayer_Attributes']['defensive_work_rate'], ax = ax3[2])

fig3.tight_layout()
In [19]:
print("Player Attrs Dimension {} - Records: {} & Attributes: {}\n".format(DFs['dfPlayer_Attributes'].shape,
                                                                          DFs['dfPlayer_Attributes'].shape[0],
                                                                          DFs['dfPlayer_Attributes'].shape[1]))

print("Attacking Work Rate:\n{}".format(DFs['dfPlayer_Attributes']['attacking_work_rate'].value_counts()))
print(80*'*')

print("Defensive Work Rate:\n{}".format(DFs['dfPlayer_Attributes']['defensive_work_rate'].value_counts()))
print(80*'*')
Player Attrs Dimension (183978, 42) - Records: 183978 & Attributes: 42

Attacking Work Rate:
medium    125070
high       42823
low         8569
None        3639
norm         348
y            106
le           104
stoc          89
Name: attacking_work_rate, dtype: int64
********************************************************************************
Defensive Work Rate:
medium    130846
high       27041
low        18432
_0          2394
o           1550
1            441
ormal        348
2            342
3            258
5            234
7            217
0            197
6            197
9            152
4            116
es           106
ean          104
tocky         89
8             78
Name: defensive_work_rate, dtype: int64
********************************************************************************

Attacking vs Defensive Work Rate

  • Work rates are how hard players works on each end of the field.
  • The levels of attacking and defensive work rate should be Low, Medium, and Hhigh. They account for the majority of instances.
  • The rest of levels can be filtered out.

Based on the count plots we can confirm filtering out the rest of levels has little impact on our analyses.

In [20]:
### Filter Work_Rates based on Low, Medium and High.
DFs['dfPlayer_Attributes_3WorkRates'] = DFs['dfPlayer_Attributes'].loc[
  (DFs['dfPlayer_Attributes']['attacking_work_rate'].isin(['medium','high','low']) & \
   DFs['dfPlayer_Attributes']['defensive_work_rate'].isin(['medium','high','low'])), :]

## After filtering out values we inspect three diagrams again:
fig4, ax4 = plt.subplots(nrows = 1, ncols = 3)
fig4.set_size_inches(12,3)

sns.countplot(x = DFs['dfPlayer_Attributes_3WorkRates']['preferred_foot'], ax = ax4[0])
sns.countplot(x = DFs['dfPlayer_Attributes_3WorkRates']['attacking_work_rate'], ax = ax4[1])
sns.countplot(x = DFs['dfPlayer_Attributes_3WorkRates']['defensive_work_rate'], ax = ax4[2])
fig4.tight_layout()

Count Plots on Preferred_Foot, Attacking_Work_Rate, and Defensive_Work_Rate

In [21]:
fig4, ax4 = plt.subplots(nrows = 1, ncols = 3)
fig4.set_size_inches(12,3)

sns.barplot(x ='preferred_foot', y = 'preferred_foot', data = DFs['dfPlayer_Attributes_3WorkRates'],\
            estimator = lambda x: len(x)/len(DFs['dfPlayer_Attributes_3WorkRates']) * 100, ax = ax4[0], orient = 'v')

ax4[0].set(ylabel = 'Percentage',title = 'Preferred Foot')
sns.barplot(x ='attacking_work_rate', y = 'attacking_work_rate', data = DFs['dfPlayer_Attributes_3WorkRates'],\
            estimator = lambda x: len(x)/len(DFs['dfPlayer_Attributes_3WorkRates']) * 100, ax = ax4[1], orient = 'v')

ax4[1].set(ylabel = 'Percentage',title = 'Attacking Work Rate')
sns.barplot(x ='defensive_work_rate', y = 'defensive_work_rate', data = DFs['dfPlayer_Attributes_3WorkRates'],\
            estimator = lambda x: len(x)/len(DFs['dfPlayer_Attributes_3WorkRates']) * 100, ax = ax4[2], orient = 'v')

ax4[2].set(ylabel = 'Percentage', title = 'Defensive Work Rate')

fig4.tight_layout()
In [22]:
att_work_rate = DFs['dfPlayer_Attributes_3WorkRates'].groupby('attacking_work_rate').size().values.tolist()
def_work_rate = DFs['dfPlayer_Attributes_3WorkRates'].groupby('defensive_work_rate').size().values.tolist()

print("Attacking work rate factor, Medium, accounts for: {:0.3f}% of features".format(
  100 * att_work_rate[2]/np.sum(att_work_rate)))

print("Defensive work rate factor, Medium, accounts for: {:0.3f}% of features".format(
  100 * def_work_rate[2]/np.sum(def_work_rate)))
Attacking work rate factor, Medium, accounts for: 70.885% of features
Defensive work rate factor, Medium, accounts for: 74.230% of features

Percentage of rows eliminated due to invalid/strange attacking and defensive work rate values:

In [23]:
print("Percentage of instances removed from player attributes table: {:0.2f}%".format(
  100* (1 - DFs['dfPlayer_Attributes_3WorkRates'].shape[0] / DFs['dfPlayer_Attributes'].shape[0])))

print("We removed {} instances from Player Attributes table".format(
  -DFs['dfPlayer_Attributes_3WorkRates'].shape[0] + DFs['dfPlayer_Attributes'].shape[0]))
Percentage of instances removed from player attributes table: 4.25%
We removed 7817 instances from Player Attributes table

Use Distribution plots to compare numerical features in Player Attributes against preferred_foot

Analyze distribution and spead of continuous features based off of categorical features. Do the levels of each categorical features drastically change the distribution / spead of the continuous features?

Preferred Foot does not distinguish any of the variables in boxplots.

In [24]:
pat = DFs['dfPlayer_Attributes_3WorkRates'].loc[:, DFs['dfPlayer_Attributes_3WorkRates'].columns.tolist()[3:]]

fig5, ax5 = plt.subplots(nrows=5,ncols=7)
fig5.set_size_inches(16,12)

for i, j in enumerate(
  DFs['dfPlayer_Attributes_3WorkRates'].select_dtypes(include = ['float64','int64']).columns[3:].tolist()):
    
  sns.distplot(pat.loc[:,j], kde = True, hist = True, ax = ax5[int(i/7)][i%7])

fig5.tight_layout()

Use boxplots to compare numerical features in Player Attributes against preferred_foot

Preferred Foot does not distinguish any of the variables in boxplots.

In [25]:
fig6, ax6 = plt.subplots(nrows=5, ncols=7)
fig6.set_size_inches(16,12)

for i,j in enumerate(
  DFs['dfPlayer_Attributes_3WorkRates'].select_dtypes(include = ['float64','int64']).columns[3:].tolist()):
  
  sns.boxplot(x = "preferred_foot", y = j, data= pat, ax = ax6[int(i/7)][i%7])
  
fig6.tight_layout()

Distribution of features the same regardless of preferred foot

In [26]:
fig7, ax7 = plt.subplots(nrows=5,ncols=7)
fig7.set_size_inches(16,12)
for i,j in enumerate(
  DFs['dfPlayer_Attributes_3WorkRates'].select_dtypes(include = ['float64','int64']).columns[3:].tolist()):
  
  sns.boxplot(x = "attacking_work_rate", y = j, data= pat, ax = ax7[int(i/7)][i%7])

fig7.tight_layout() 

Compared with "preferred foot", "attacking work rate" is a better criteria to separate the features to some extent at either high or low attacking work rate. Yet, be aware that the Medium attacking work rate takes up to 70% of the instances. For features related to atacking attributes, attacking work rate feature does a good job of distinguishing instances of high and low categorical values.

Similar to attacking work rate, "defensive work rate" is a decent criteria to differentiate defensive attributes than "preferred foot" at high and low defensive work rate values.

In [27]:
## View Relationships between Defensive Work Rate and all other Numerical Attributes on Updated Player Attributes DataFrame:

fig8, ax8 = plt.subplots(nrows=5,ncols=7)
fig8.set_size_inches(16,12)

for i,j in enumerate(
  DFs['dfPlayer_Attributes_3WorkRates'].select_dtypes(include = ['float64','int64']).columns[3:].tolist()):
  
  sns.boxplot(x = "defensive_work_rate", y = j, data= pat, ax = ax8[int(i/7)][i%7])

fig8.tight_layout()
In [28]:
## Filter out entries with empty team_fifa_api_id in Team DataFrame:
DFs['dfTeam_Updated'] = DFs['dfTeam'][ ~DFs['dfTeam'].loc[:,'team_fifa_api_id'].isna()]

## Analyzing the updated Team dataframe:
viewDF(DFs['dfTeam_Updated'], *attrs)

print("\nDescribe Categorical Attributes: {}".format(descCatAttrs(DFs['dfTeam_Updated'])))
print("\n{} Attributes with Missing Values: {}".format(tbl, showAttrsWMissingValues(DFs['dfTeam_Updated'])))
print("\n{} Members with Missing Values: {}".format(tbl, showEntitiesWMissingValues(DFs['dfTeam_Updated'])))


print("Cat Attributes Uniqueness: {}".format(
  DFs['dfTeam_Updated'].select_dtypes(exclude=['float64','int64']).apply(lambda x: len(x.unique().tolist()), axis=0)))

print("Team Long Name Uniqueness: {};\tTeam Long Name Uniqueness: {}".format(
  len(DFs['dfTeam_Updated']['team_long_name'].unique().tolist()),
  len(DFs['dfTeam_Updated']['team_short_name'].unique().tolist())))
 DataFrame shape: (288, 5)
 DataFrame columns: ['id', 'team_api_id', 'team_fifa_api_id', 'team_long_name', 'team_short_name']

info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 288 entries, 0 to 297
Data columns (total 5 columns):
id                  288 non-null int64
team_api_id         288 non-null int64
team_fifa_api_id    288 non-null float64
team_long_name      288 non-null object
team_short_name     288 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 13.5+ KB

describe():                  id    team_api_id  team_fifa_api_id
count    288.000000     288.000000        288.000000
mean   23481.666667   11317.354167      21534.305556
std    15102.715862   22228.138190      42456.439408
min        1.000000    1601.000000          1.000000
25%     9551.750000    8370.750000        178.750000
50%    21288.500000    8658.500000        673.500000
75%    35770.250000    9885.750000       1910.750000
max    50204.000000  274581.000000     112513.000000

Describe Categorical Attributes:        team_long_name team_short_name
count             288             288
unique            285             251
top     Polonia Bytom             MON
freq                2               3

Team_Attributes Attributes with Missing Values: id                  0
team_api_id         0
team_fifa_api_id    0
team_long_name      0
team_short_name     0
dtype: int64

Team_Attributes Members with Missing Values: 0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
9      0
10     0
11     0
12     0
13     0
15     0
16     0
17     0
18     0
19     0
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
30     0
31     0
32     0
33     0
34     0
35     0
36     0
37     0
38     0
39     0
40     0
41     0
42     0
43     0
44     0
45     0
46     0
47     0
48     0
49     0
50     0
51     0
52     0
53     0
54     0
55     0
56     0
57     0
58     0
59     0
60     0
61     0
62     0
63     0
64     0
65     0
66     0
67     0
68     0
69     0
70     0
71     0
72     0
73     0
74     0
75     0
76     0
77     0
78     0
79     0
80     0
81     0
82     0
83     0
84     0
85     0
86     0
87     0
88     0
89     0
90     0
91     0
92     0
93     0
94     0
95     0
96     0
97     0
98     0
99     0
100    0
101    0
102    0
103    0
104    0
105    0
106    0
107    0
108    0
109    0
110    0
111    0
112    0
113    0
114    0
115    0
116    0
117    0
118    0
119    0
120    0
121    0
122    0
123    0
124    0
125    0
126    0
127    0
128    0
129    0
130    0
131    0
132    0
133    0
134    0
135    0
136    0
137    0
138    0
139    0
140    0
141    0
142    0
143    0
144    0
145    0
146    0
147    0
148    0
149    0
150    0
151    0
152    0
153    0
154    0
155    0
156    0
157    0
158    0
159    0
160    0
161    0
162    0
163    0
164    0
165    0
166    0
167    0
168    0
169    0
171    0
172    0
173    0
174    0
175    0
176    0
177    0
178    0
179    0
180    0
181    0
182    0
183    0
184    0
185    0
186    0
187    0
188    0
189    0
190    0
191    0
192    0
193    0
194    0
195    0
196    0
197    0
198    0
199    0
200    0
201    0
202    0
203    0
205    0
206    0
207    0
209    0
210    0
211    0
212    0
214    0
215    0
216    0
217    0
218    0
219    0
220    0
221    0
222    0
224    0
226    0
227    0
228    0
229    0
230    0
231    0
234    0
235    0
236    0
237    0
238    0
239    0
240    0
241    0
242    0
243    0
244    0
245    0
246    0
247    0
248    0
249    0
250    0
251    0
252    0
253    0
254    0
255    0
256    0
257    0
258    0
259    0
260    0
261    0
262    0
263    0
264    0
265    0
266    0
267    0
268    0
269    0
270    0
271    0
272    0
273    0
274    0
275    0
276    0
277    0
278    0
279    0
280    0
281    0
282    0
283    0
284    0
285    0
286    0
287    0
288    0
289    0
290    0
291    0
292    0
293    0
294    0
295    0
296    0
297    0
dtype: int64
Cat Attributes Uniqueness: team_long_name     285
team_short_name    251
dtype: int64
Team Long Name Uniqueness: 285;	Team Long Name Uniqueness: 251

Analyzing Team Attributes Table

In Team Attributes dataframe, "buildUpPlayDribbling" is the only field with missing values. It doesn't seem to be related to the other featues and more than 50% of the features are missing. To avoid to skew data, drop the "buildUpPlayDribbling" feaure from dataset and continue analyzing the rest.

Perform the following tasks:

  • Use describe() to view categorical attributes on the updated Team Attributes dataframe.
  • Show attributes with missing values in the updated dataframe.
  • Show any members with missing fields
  • Check Uniqueness on dataframe
In [29]:
## Drop out buildUpPlayDribbling attribute 
DFs['dfTeam_Attributes_Updated'] = DFs['dfTeam_Attributes'].drop(['buildUpPlayDribbling'], axis = 1)

## Describe Categorical Attributes
print("\nDescribe Categorical attributes: {}".format(descCatAttrs(DFs['dfTeam_Attributes_Updated'])))

print("\n{} Attributes with Missing Values: {}".format(
  tbl, showAttrsWMissingValues(DFs['dfTeam_Attributes_Updated'])))

print("\n{} Members with Missing Values: {}".format(
  tbl, showEntitiesWMissingValues(DFs['dfTeam_Attributes_Updated'])))

## Uniqueness
print("\nUnique Values in Categorical Attributes: {}".format(
  DFs['dfTeam_Attributes_Updated'].select_dtypes(exclude=['float64','int64']).apply(
    lambda x: len(x.unique().tolist()), axis = 0)))
Describe Categorical attributes:                        date buildUpPlaySpeedClass buildUpPlayDribblingClass  \
count                  1458                  1458                      1458   
unique                    6                     3                         3   
top     2015-09-10 00:00:00              Balanced                    Little   
freq                    245                  1184                      1004   

       buildUpPlayPassingClass buildUpPlayPositioningClass  \
count                     1458                        1458   
unique                       3                           2   
top                      Mixed                   Organised   
freq                      1236                        1386   

       chanceCreationPassingClass chanceCreationCrossingClass  \
count                        1458                        1458   
unique                          3                           3   
top                        Normal                      Normal   
freq                         1231                        1195   

       chanceCreationShootingClass chanceCreationPositioningClass  \
count                         1458                           1458   
unique                           3                              2   
top                         Normal                      Organised   
freq                          1224                           1309   

       defencePressureClass defenceAggressionClass defenceTeamWidthClass  \
count                  1458                   1458                  1458   
unique                    3                      3                     3   
top                  Medium                  Press                Normal   
freq                   1243                   1274                  1286   

       defenceDefenderLineClass  
count                      1458  
unique                        2  
top                       Cover  
freq                       1362  

Team_Attributes Attributes with Missing Values: id                                0
team_fifa_api_id                  0
team_api_id                       0
date                              0
buildUpPlaySpeed                  0
buildUpPlaySpeedClass             0
buildUpPlayDribblingClass         0
buildUpPlayPassing                0
buildUpPlayPassingClass           0
buildUpPlayPositioningClass       0
chanceCreationPassing             0
chanceCreationPassingClass        0
chanceCreationCrossing            0
chanceCreationCrossingClass       0
chanceCreationShooting            0
chanceCreationShootingClass       0
chanceCreationPositioningClass    0
defencePressure                   0
defencePressureClass              0
defenceAggression                 0
defenceAggressionClass            0
defenceTeamWidth                  0
defenceTeamWidthClass             0
defenceDefenderLineClass          0
dtype: int64

Team_Attributes Members with Missing Values: 0       0
1       0
2       0
3       0
4       0
5       0
6       0
7       0
8       0
9       0
10      0
11      0
12      0
13      0
14      0
15      0
16      0
17      0
18      0
19      0
20      0
21      0
22      0
23      0
24      0
25      0
26      0
27      0
28      0
29      0
30      0
31      0
32      0
33      0
34      0
35      0
36      0
37      0
38      0
39      0
40      0
41      0
42      0
43      0
44      0
45      0
46      0
47      0
48      0
49      0
50      0
51      0
52      0
53      0
54      0
55      0
56      0
57      0
58      0
59      0
60      0
61      0
62      0
63      0
64      0
65      0
66      0
67      0
68      0
69      0
70      0
71      0
72      0
73      0
74      0
75      0
76      0
77      0
78      0
79      0
80      0
81      0
82      0
83      0
84      0
85      0
86      0
87      0
88      0
89      0
90      0
91      0
92      0
93      0
94      0
95      0
96      0
97      0
98      0
99      0
100     0
101     0
102     0
103     0
104     0
105     0
106     0
107     0
108     0
109     0
110     0
111     0
112     0
113     0
114     0
115     0
116     0
117     0
118     0
119     0
120     0
121     0
122     0
123     0
124     0
125     0
126     0
127     0
128     0
129     0
130     0
131     0
132     0
133     0
134     0
135     0
136     0
137     0
138     0
139     0
140     0
141     0
142     0
143     0
144     0
145     0
146     0
147     0
148     0
149     0
150     0
151     0
152     0
153     0
154     0
155     0
156     0
157     0
158     0
159     0
160     0
161     0
162     0
163     0
164     0
165     0
166     0
167     0
168     0
169     0
170     0
171     0
172     0
173     0
174     0
175     0
176     0
177     0
178     0
179     0
180     0
181     0
182     0
183     0
184     0
185     0
186     0
187     0
188     0
189     0
190     0
191     0
192     0
193     0
194     0
195     0
196     0
197     0
198     0
199     0
200     0
201     0
202     0
203     0
204     0
205     0
206     0
207     0
208     0
209     0
210     0
211     0
212     0
213     0
214     0
215     0
216     0
217     0
218     0
219     0
220     0
221     0
222     0
223     0
224     0
225     0
226     0
227     0
228     0
229     0
230     0
231     0
232     0
233     0
234     0
235     0
236     0
237     0
238     0
239     0
240     0
241     0
242     0
243     0
244     0
245     0
246     0
247     0
248     0
249     0
       ..
1208    0
1209    0
1210    0
1211    0
1212    0
1213    0
1214    0
1215    0
1216    0
1217    0
1218    0
1219    0
1220    0
1221    0
1222    0
1223    0
1224    0
1225    0
1226    0
1227    0
1228    0
1229    0
1230    0
1231    0
1232    0
1233    0
1234    0
1235    0
1236    0
1237    0
1238    0
1239    0
1240    0
1241    0
1242    0
1243    0
1244    0
1245    0
1246    0
1247    0
1248    0
1249    0
1250    0
1251    0
1252    0
1253    0
1254    0
1255    0
1256    0
1257    0
1258    0
1259    0
1260    0
1261    0
1262    0
1263    0
1264    0
1265    0
1266    0
1267    0
1268    0
1269    0
1270    0
1271    0
1272    0
1273    0
1274    0
1275    0
1276    0
1277    0
1278    0
1279    0
1280    0
1281    0
1282    0
1283    0
1284    0
1285    0
1286    0
1287    0
1288    0
1289    0
1290    0
1291    0
1292    0
1293    0
1294    0
1295    0
1296    0
1297    0
1298    0
1299    0
1300    0
1301    0
1302    0
1303    0
1304    0
1305    0
1306    0
1307    0
1308    0
1309    0
1310    0
1311    0
1312    0
1313    0
1314    0
1315    0
1316    0
1317    0
1318    0
1319    0
1320    0
1321    0
1322    0
1323    0
1324    0
1325    0
1326    0
1327    0
1328    0
1329    0
1330    0
1331    0
1332    0
1333    0
1334    0
1335    0
1336    0
1337    0
1338    0
1339    0
1340    0
1341    0
1342    0
1343    0
1344    0
1345    0
1346    0
1347    0
1348    0
1349    0
1350    0
1351    0
1352    0
1353    0
1354    0
1355    0
1356    0
1357    0
1358    0
1359    0
1360    0
1361    0
1362    0
1363    0
1364    0
1365    0
1366    0
1367    0
1368    0
1369    0
1370    0
1371    0
1372    0
1373    0
1374    0
1375    0
1376    0
1377    0
1378    0
1379    0
1380    0
1381    0
1382    0
1383    0
1384    0
1385    0
1386    0
1387    0
1388    0
1389    0
1390    0
1391    0
1392    0
1393    0
1394    0
1395    0
1396    0
1397    0
1398    0
1399    0
1400    0
1401    0
1402    0
1403    0
1404    0
1405    0
1406    0
1407    0
1408    0
1409    0
1410    0
1411    0
1412    0
1413    0
1414    0
1415    0
1416    0
1417    0
1418    0
1419    0
1420    0
1421    0
1422    0
1423    0
1424    0
1425    0
1426    0
1427    0
1428    0
1429    0
1430    0
1431    0
1432    0
1433    0
1434    0
1435    0
1436    0
1437    0
1438    0
1439    0
1440    0
1441    0
1442    0
1443    0
1444    0
1445    0
1446    0
1447    0
1448    0
1449    0
1450    0
1451    0
1452    0
1453    0
1454    0
1455    0
1456    0
1457    0
Length: 1458, dtype: int64

Unique Values in Categorical Attributes: date                              6
buildUpPlaySpeedClass             3
buildUpPlayDribblingClass         3
buildUpPlayPassingClass           3
buildUpPlayPositioningClass       2
chanceCreationPassingClass        3
chanceCreationCrossingClass       3
chanceCreationShootingClass       3
chanceCreationPositioningClass    2
defencePressureClass              3
defenceAggressionClass            3
defenceTeamWidthClass             3
defenceDefenderLineClass          2
dtype: int64

View Distribution on Numerical Team Attributes

In [30]:
tat = DFs['dfTeam_Attributes_Updated'].loc[:, DFs['dfTeam_Attributes_Updated'].columns.tolist()[3:]]

sns.pairplot(tat);
In [31]:
fig9, ax9 = plt.subplots(nrows=2,ncols=4)
fig9.set_size_inches(12,6)
for i,j in enumerate(DFs['dfTeam_Attributes_Updated'].select_dtypes(include = ['int64']).columns[3:].tolist()):
    sns.distplot(tat.loc[:,j],kde =True,hist = True, ax = ax9[int(i/4)][i%4])
fig9.tight_layout()

View Boxplot on Numerical Team Attributes

In [32]:
sns.boxplot(data = DFs['dfTeam_Attributes_Updated'].select_dtypes(include = ['int64']).iloc[:,3:], orient = 'h');

View Bar Plots on Categorical Team Attributes

In [33]:
### List 8 Numerical Team Attributes:
print(tat.select_dtypes(include = ["int64"]).columns.tolist())

fig9, ax9 = plt.subplots(nrows=3, ncols=4)
fig9.set_size_inches(14,8)

for i,j in enumerate(DFs['dfTeam_Attributes_Updated'].select_dtypes(include = ['object']).columns[1:].tolist()):
  sns.barplot(x = j, y = j, data = tat,
              estimator = lambda x: len(x)/len(tat) * 100, ax = ax9[int(i/4)][i%4],
              orient = 'v')
  ax9[int(i/4)][i%4].set(xlabel = "")

fig9.tight_layout()
['buildUpPlaySpeed', 'buildUpPlayPassing', 'chanceCreationPassing', 'chanceCreationCrossing', 'chanceCreationShooting', 'defencePressure', 'defenceAggression', 'defenceTeamWidth']

Exploratory Data Analysis

I used Pandas DataFrame merge() to combine data in different dataframes and present analyses in diagrams.

Among seven dataframes,

  • Players and Teams are the targets to understand. Both combined with their corresponding attributes dataframes can provide certain insight in their domain;
  • Matches are records of archievement. To compare or evaluate how well Players, Teams and Leagues performed we blend in Matches and make prediction.
  • Countries and Leagues are the base for grouping. They serve as scopes for corresponding analyses.

Exploratory analyses are where I present merged data from different dataframes using visualization techniques.

In [34]:
## merge player data with player attributes
DFs['dfPlayers'] = DFs['dfPlayer_Attributes'].merge(DFs['dfPlayer'],
                                                    left_on="player_api_id",
                                                    right_on="player_api_id",
                                                    how="left")

ignoredPlayerAttrs =["id_x", "id_y", 'player_fifa_api_id_y', 'height', 'weight',
                     'weight_kg', 'height_m', 'bmi', 'year', 'age', 'birthday']

DFs['dfPlayers_WithSelectedAttrs'] = DFs['dfPlayers'][
                                        [x for x in DFs['dfPlayers'].columns if x not in ignoredPlayerAttrs]]

DFs['dfPlayers_WithSelectedAttrs']["date"] = pd.to_datetime(DFs['dfPlayers_WithSelectedAttrs']["date"],
                                                            format="%Y-%m-%d")

###  Players Who Played in Most Games
ax = DFs['dfPlayers_WithSelectedAttrs']["player_name"].value_counts().sort_values()[-20:].plot(
  kind="barh", figsize=(10,10), color="b",width=.9, linewidth = 1, edgecolor = "k"*20)

for i,j in enumerate(DFs['dfPlayers_WithSelectedAttrs']["player_name"].value_counts().sort_values()[-20:].values):
  ax.text(.7, i, j, weight = "bold", color="white")
ax.set_title("Players Who Played Highest Matches")
plt.show()

Does Player's "Overall Rating" depend on any Player Attribute(s)?

Scatter plots reveal the correlation among attributes in a dataset.

Here we see scatter plots between "overall rating" and a list of selected player attributes. Values on "potential" and "reactions" attributes appear to have correlation with "overall rating".

In [35]:
### scatter plot for overall rating and player attributes
playerAttrs = ["player_api_id",'overall_rating',
       'potential','crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes']

x = DFs['dfPlayer_Attributes'][playerAttrs]
x = x.groupby("player_api_id")[playerAttrs].mean()
x = x.drop("player_api_id", axis =1).reset_index().drop("player_api_id",axis=1)

selectedPlayerAttrs = ['potential','crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes' ]

length = len(selectedPlayerAttrs)
plt.figure(figsize=(15,15))

for i,j in itertools.zip_longest(selectedPlayerAttrs, range(length)):
  plt.subplot(5,7,j+1)
  plt.scatter(x["overall_rating"], x[i], s=.01, color="Orange")
  plt.title(i)
  plt.subplots_adjust(hspace =.4)

Comparing Skills Among Players with Top Overall_Rating?

What are the similarities and differences in terms of skills among Top Overall_Rating Players?

Radar charts are an effective tool to visualize the differences among selected attributes. Not only can they be used to identify strengths and weaknesses of a subject radar charts show performance gaps when they're used to compare the same set of attributes among subjects. In this section I used radar charts to compare 30 selected attritbues based on top players and a specific criteria. In addition to identify the performance gaps and strengths and weaknesses we can also use them to see possible paradigm shift when we compare the data in different time.

In this section, I defined the Top-Player criteria as "overall_rating greater than 91.0". It returned 6 players. Each top-rated player's radar charts consisted of 30 attributes. Eight radar charts were present below.

  • Cristiano Ronaldo,
  • Gianluigi Buffon,
  • Gregory Coupet,
  • Lionel Messi,
  • Wayne Rooney,
  • Xavi Hernandez

If my Top-Player criteria changed to "ball_control greater than 94.0" I got 8 players instead. Follow the same practice. With the change on top-player criteria we can gain an understanding of our players

Based on the "overall_rating greater than 91.0" criteria we have 6 "top players". My observations are,

  • All selected players got high scores on their "potentials" and "reactions" as suggested in scatter matrix.
  • Three patterns of radar charts. It turns out the difference on their skill spectrum has to do with their positions in their teams:
    • Gianluigi Buffon and Gregory Coupet are in "GoalKeeper" position;
    • Xavi Hernandez, "Midfielder";
    • And, C Ronaldo, L Messi and W Rooney are Forward.

Since goal keeper related attributes were not selected Buffon and Coupet radar charts appeared "weak" while their overall ratings were high. Compared the only Midfielder with the rest of three Forwards, we saw:

  • Hernandez was strong on "sliding tackle", "standing tackle", and "interception" which are essential skills for a great Midfielder.
  • While three Forwards, C Ronaldo, L Messi, and W Rooney, possesed comprehensive skills they shared the same pattern of getting relatively low score on "sliding tackle", "standing tackle", and "interception".

Positions influence radar charts.

  • Forwards and Midfielders shared a lot in common and were strong in most areas, such as Ball Control, Dribbling and Balance, etc.
  • Midfielders usually scored higher in Tackles and Interception compared with Forwards. Andres Iniesta, another great Midfielder, possesed the same character as Hernandez'.
  • Forwards were strong on Acceleration, Finishing, Agility, Reactions, and Shot Power, etc.
In [36]:
### Radar Chart for Top Players and their Attributes

keyAttr, mgnAttr = 'overall_rating', 91.0     ## Selected Attribute, Margin of Attribute
## keyAttr, mgnAttr = 'ball_control', 94.0

topRatedPlayers = DFs['dfPlayers_WithSelectedAttrs'][getattr(DFs['dfPlayers_WithSelectedAttrs'], keyAttr) > mgnAttr ]

topRatedPlayers = topRatedPlayers['player_name'].unique()
numTopRatedPlayers = len(topRatedPlayers)
print("Top-{} players in terms of {} over {}:\n{}".format(numTopRatedPlayers,
                                                          keyAttr,
                                                          mgnAttr,
                                                          ",\n".join(topRatedPlayers)))

num_l = [i for i in range(numTopRatedPlayers)]

play   = DFs['dfPlayers_WithSelectedAttrs'][ DFs['dfPlayers_WithSelectedAttrs']["player_name"].isin(topRatedPlayers)]

selectedPlayerAttrs   = ["player_name", 'overall_rating','potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'standing_tackle', 'sliding_tackle']

play = play[selectedPlayerAttrs]
play = play.groupby("player_name")[selectedPlayerAttrs].mean().reset_index()

cplt       = sns.color_palette("Set1", numTopRatedPlayers)

plt.figure(figsize=(13,23))

col = 2
row = int(numTopRatedPlayers / col)
if numTopRatedPlayers > row * col:
  row += 1 

for i, j, k in itertools.zip_longest(num_l, range(numTopRatedPlayers), cplt):

  plt.subplot( row , col , j + 1, projection="polar")
  cats = list(play)[1:]
  N    = len(cats)  
    
  values = play.loc[i].drop("player_name").values.flatten().tolist()
  values += values[:1]
  values
  angles = [n / float(N) * 2 * pi for n in range(N)]
  angles += angles[:1]
  angles
    
  plt.xticks(angles[:-1],cats,color="k",size=7)
  plt.ylim([0,100])
  plt.plot(angles, values, color=k, linewidth=2, linestyle="solid")
  plt.fill(angles, values, color=k, alpha=0.5)
  plt.title(play["player_name"][i], color="r")
  plt.subplots_adjust(wspace=.4, hspace=.4)
Top-6 players in terms of overall_rating over 91.0:
Cristiano Ronaldo,
Gianluigi Buffon,
Gregory Coupet,
Lionel Messi,
Wayne Rooney,
Xavi Hernandez

Compare Top "Overall Rating" Players Using Bar Charts

Use Bar charts to visualize and compare the skills among top overall_rating players.

In [37]:
### TOP Overall Rating PLAYERS STATS (rating > 91)
top_rated = DFs['dfPlayers_WithSelectedAttrs'][ DFs['dfPlayers_WithSelectedAttrs']["overall_rating"]  > 91 ]
top_rated = top_rated[['player_name', 'player_api_id', 'date', 'overall_rating', 'potential','finishing',
                       'acceleration', 'ball_control', 'penalties']]
top_rated = top_rated.groupby("player_name").agg({'overall_rating':"mean",
                                                  'potential':"mean",
                                                  'finishing':"mean",
                                                  'acceleration':"mean",
                                                  'ball_control':"mean" ,
                                                  'penalties':"mean"})

top_rated.plot(kind="bar", width=.6, figsize=(15,5), colors=["r","b","lime","gold","c","m"],
               alpha=.5, linewidth = 1,edgecolor = "k"*6)

plt.title("Top Player Stats")
plt.xticks(rotation = 0)
plt.legend(loc ="Lower Center")
plt.show()

Word Cloud of High Overall Rating Players

Create a Word Cloud for players with Overall Rating beyond 88

Although a Word Cloud doesn't carry much statistics info it is an intuitive way to present the frequency of all words in a selected text. As shown below, top rated players, such as L. Messe, C. Ronaldo, F. Ribery, and L. Suarez, etc., stand out from other names. I also printed out a list of tuples with their names and the counts for verification purpose.

In [38]:
## Top Players
topRatedPlayers_l =  DFs['dfPlayers_WithSelectedAttrs'][DFs['dfPlayers_WithSelectedAttrs']["overall_rating"]  > 88 ]["player_name"].value_counts()

## (Player_Name, Won_Count) :
tplTopRatedPlayerNamesTimes = [ (i, topRatedPlayers_l[i]) for i in topRatedPlayers_l.index ]
print("Rating of Top Players:\n {}".format(tplTopRatedPlayerNamesTimes))

txtTopRatedPlayerNames = []
for i in topRatedPlayers_l.index:
  x = i.split()
  txtTopRatedPlayerNames.extend( ["_".join(x) if len(x) > 1 else x[0]] )

img = np.array(Image.open("input/picture-wrd/circle.jpg"))
               
wc = WordCloud(background_color="black",
               scale=3,
               mask=img,
               colormap="prism",
               max_words=100000).generate(" ".join(txtTopRatedPlayerNames))

fig = plt.figure(figsize=(10,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.title("Word Cloud of Top Rated Players")
plt.show()
Rating of Top Players:
 [('Lionel Messi', 24), ('Cristiano Ronaldo', 24), ('Andres Iniesta', 15), ('Franck Ribery', 12), ('Zlatan Ibrahimovic', 11), ('Luis Suarez', 11), ('Robin van Persie', 11), ('Iker Casillas', 9), ('Manuel Neuer', 9), ('Arjen Robben', 8), ('Xavi Hernandez', 7), ('Wayne Rooney', 6), ('Radamel Falcao', 6), ('Gianluigi Buffon', 4), ('Neymar', 4), ('Nemanja Vidic', 3), ('Kaka', 3), ('Petr Cech', 2), ('Cris', 2), ('Ze Roberto', 2), ('Eden Hazard', 2), ("Samuel Eto'o", 2), ('Ronaldinho', 2), ('David Villa', 2), ('Alessandro Nesta', 2), ('John Terry', 2), ('Thierry Henry', 2), ('Julio Cesar', 2), ('David Trezeguet', 1), ('Carles Puyol', 1), ('Gregory Coupet', 1), ('Adriano', 1), ('Francesco Totti', 1), ('Frank Lampard', 1), ('Steven Gerrard', 1), ('Cesc Fabregas', 1), ('Fabio Cannavaro', 1), ('Luca Toni', 1), ('Lucio', 1)]

Preferred Foot and Top Overall Rating Players

Does the proportion of preferred foot change between all players and top players with overall rating higher than 80?

In [39]:
### Preferred Foot By Players
plt.figure(figsize=(12,6))

mgnTopOverallRating = 80

plt.subplot(121)
pf = DFs['dfPlayers_WithSelectedAttrs'].groupby(
  ["player_api_id", "preferred_foot"])["overall_rating"].mean().reset_index()["preferred_foot"].value_counts()

pf.plot.pie(autopct = "%1.0f%%", shadow = True, wedgeprops={"linewidth":2,"edgecolor":"white"},
            colors=["grey","r"], explode=[0,.1],startangle=45)

plt.title("Preferred Foot By All Players")
plt.ylabel("")

plt.subplot(122)
t_f = DFs['dfPlayers_WithSelectedAttrs'].groupby(
  ["player_api_id","preferred_foot"])["overall_rating"].mean().reset_index()

t_f[t_f["overall_rating"] > mgnTopOverallRating]["preferred_foot"].value_counts().plot.pie(
  autopct = "%1.0f%%", shadow = True, wedgeprops={"linewidth":2, "edgecolor":"white"}, colors=["grey","r"],
  explode=[0,.1], startangle=45)

plt.title("Preferred Foot By Players with Overall Rating > {}".format(mgnTopOverallRating))
plt.ylabel("")
plt.show()
In [40]:
### Preferred Foot By Attackers vs Defenders

x = DFs['dfPlayers_WithSelectedAttrs'][ DFs['dfPlayers_WithSelectedAttrs']["attacking_work_rate"] == "low"]
x = x.groupby(["player_api_id","player_name","preferred_foot"])["date"].count().reset_index()

plt.figure(figsize=(12,6))

plt.subplot(121)
x["preferred_foot"].value_counts().plot.pie(autopct = "%1.0f%%", shadow = True,
                                            wedgeprops={"linewidth":2,"edgecolor":"white"},
                                            colors=["grey","orange"],explode=[0,.1],startangle=50)

plt.ylabel("")
plt.title("Preferred foot by players with High attacking rate")

x = DFs['dfPlayers_WithSelectedAttrs'][ DFs['dfPlayers_WithSelectedAttrs']["defensive_work_rate"] == "low"]
x = x.groupby(["player_api_id","player_name","preferred_foot"])["date"].count().reset_index()


plt.subplot(122)
x["preferred_foot"].value_counts().plot.pie(autopct = "%1.0f%%", shadow = True,
                                            wedgeprops={"linewidth":2,"edgecolor":"white"},
                                            colors=["grey","orange"],explode=[0,.1],startangle=45)

plt.ylabel("")
plt.title("Preferred foot by players with High defensive rate")

plt.show()
In [41]:
idx  = "player_api_id"
cols = ['overall_rating','potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle']

def player_comparator(player1, player2):
    
  x1 = DFs['dfPlayers_WithSelectedAttrs'][ DFs['dfPlayers_WithSelectedAttrs']["player_name"] == player1]
  x1 = x1.groupby(["player_name"])[cols].mean()
    
  x2 = DFs['dfPlayers_WithSelectedAttrs'][ DFs['dfPlayers_WithSelectedAttrs']["player_name"] == player2]
  x2 = x2.groupby(["player_name"])[cols].mean()
    
  z  = pd.concat([x1,x2]).transpose().reset_index()
  z  = z.rename( columns={"index":"attributes", player1:player1, player2:player2})
  z.index = z.attributes
  z[[player1,player2]].plot(kind = "barh",
                            figsize = (8,12),
                            colors = ["orange","grey"],
                            linewidth = 1 ,
                            width = .7,
                            edgecolor = ["k"]*z["attributes"].nunique()
                            )
  plt.xlabel("mean value")
  plt.legend(loc = "best",prop = {"size" : 15})
  plt.grid(True, alpha = .3)
  plt.title("{} vs {}".format(player1, player2))
  ## plt.title(player1+"  vs  "+player2)

# MESSI VS RONALDO
player_comparator('Lionel Messi','Cristiano Ronaldo')

## 'Ronaldinho' & 'Wayne Rooney'
player_comparator( 'Ronaldinho','Wayne Rooney')

## 'Zlatan Ibrahimovic' & 'Cristiano Ronaldo'
player_comparator('Zlatan Ibrahimovic','Cristiano Ronaldo')

Goal Keeper

In [42]:
## goal_keeper == DFs['dfGoalKeeper']
DFs['dfGoalKeeper'] = DFs['dfPlayers_WithSelectedAttrs'][["player_api_id",'gk_diving', 'gk_handling', 'gk_kicking',
                                                          'gk_positioning','gk_reflexes', 'player_name',
                                                          "overall_rating"]]

DFs['dfGoalKeeper'] = DFs['dfGoalKeeper'][(DFs['dfGoalKeeper']["gk_diving"] > 75) &
                                          (DFs['dfGoalKeeper']["gk_handling"] > 75) &
                                          (DFs['dfGoalKeeper']["gk_kicking"] > 75) &
                                          (DFs['dfGoalKeeper']["gk_positioning"] > 75) &
                                          (DFs['dfGoalKeeper']["gk_reflexes"] > 75) ]

DFs['dfGoalKeeper'] = DFs['dfGoalKeeper'].groupby(
  ["player_api_id", "player_name"])[['gk_diving', 'gk_handling', 'gk_kicking',
                                     'gk_positioning', 'gk_reflexes', 'overall_rating']].mean()

DFs['dfGoalKeeper'] = DFs['dfGoalKeeper'].sort_values(by="overall_rating", ascending =False).reset_index()
DFs['dfGoalKeeper'].index = DFs['dfGoalKeeper']["player_name"]
DFs['dfGoalKeeper']


### Top Goal Keepers Stats
DFs['dfGoalKeeper'][['gk_diving', 'gk_handling', 'gk_kicking',
       'gk_positioning', 'gk_reflexes']][:8].plot(kind = "bar", figsize=(15,5),
                                                  color =["r","b","grey","lime","yellow"],
                                                  linewidth = 1, edgecolor = "k"*10 )

plt.xticks(rotation =0)
plt.legend(loc ="lower center")
plt.ylabel("mean value")
plt.title("# Top Goal Keepers Stats")
plt.show()

DISTRIBUTION OF TEAM ATTRIBUTES AMONG TEAMS

In [43]:
### Merge Teams with Team Attributes
DFs["dfTeams"] =  DFs["dfTeam_Attributes"].merge(DFs["dfTeam"],
                                                 left_on="team_api_id",
                                                 right_on="team_api_id",
                                                 how="left")

DFs["dfTeams_Refined"] = DFs["dfTeams"].drop(['id_x','id_y', 'team_fifa_api_id_y'], axis=1)
DFs["dfTeams_Refined"]["date"] = pd.to_datetime(DFs["dfTeams_Refined"]["date"], format="%Y-%m-%d")

columns = DFs["dfTeams_Refined"].columns

cat_col = columns[ columns.str.contains("Class") ].tolist()
num_col= [x for x in DFs["dfTeams_Refined"].columns if x not in columns[columns.str.contains("Class")].tolist() + 
          ["team_api_id"] + ['team_fifa_api_id_x'] + ["date"] + ['team_long_name'] + ['team_short_name']]

TeamMark_l = ["team_api_id"]+['team_fifa_api_id_x']+["date"]+['team_long_name']+[ 'team_short_name']

categorical_team_info = DFs["dfTeams_Refined"][cat_col + TeamMark_l]
numerical_team_info   = DFs["dfTeams_Refined"][num_col + TeamMark_l]

numerical_team_info


# from scipy.stats import mode
c = categorical_team_info.groupby("team_long_name").agg({"buildUpPlaySpeedClass":lambda x:mode(x)[0],
                                                    "buildUpPlayDribblingClass":lambda x:mode(x)[0],
                                                    'buildUpPlayPassingClass':lambda x:mode(x)[0],
                                                    'buildUpPlayPositioningClass':lambda x:mode(x)[0],
                                                    'chanceCreationPassingClass':lambda x:mode(x)[0],
                                                    'chanceCreationCrossingClass':lambda x:mode(x)[0],
                                                     'chanceCreationShootingClass':lambda x:mode(x)[0],
                                                     'chanceCreationPositioningClass':lambda x:mode(x)[0],
                                                     'defencePressureClass':lambda x:mode(x)[0],
                                                     'defenceAggressionClass':lambda x:mode(x)[0],
                                                     'defenceTeamWidthClass':lambda x:mode(x)[0],
                                                     'defenceDefenderLineClass':lambda x:mode(x)[0]}).reset_index()

cat_col
plt.figure(figsize=(13,19))
for i,j in itertools.zip_longest(cat_col,range(len(cat_col))):
    plt.subplot(4,3,j+1)
    plt.pie(c[i].value_counts().values,labels=c[i].value_counts().keys(),
            wedgeprops={"linewidth":3,"edgecolor":"w"},
           colors=sns.color_palette("Pastel1"), autopct = "%1.0f%%")
    my_circ = plt.Circle((0,0),.7,color="white")
    plt.gca().add_artist(my_circ)
    plt.title(i)
    plt.xlabel("")

Top Teams by Team Attributes

This section visualized the ranks of teams in each numerical team attributes. Same technique can apply to play attributes.

In [44]:
### Top Teams by Team Attributes

n = numerical_team_info.groupby("team_long_name")[num_col].mean().reset_index()
cols = [x for x in n.columns if x not in ["team_long_name"]]

length = len(cols)

plt.figure(figsize=(13,13))

for i,j in itertools.zip_longest(cols, range(length)):
  plt.subplot(length/3, length/3, j+1)
  ax = sns.barplot(i, "team_long_name", data=n. sort_values(by=i, ascending=False)[:7], palette="winter")
  plt.title(i)
  plt.subplots_adjust(wspace = .6, hspace =.3)
  plt.ylabel("")
    
  for i,j in enumerate(round(n.sort_values(by = i,ascending=False)[i][:7],2)):
    ax.text(.7,i,j,weight = "bold",color="white") 

Team Comparator

Build a comparator to compare the performance and capabilities between two team. Nine categories were chosen as the key measurement. Radar charts create an intuitive graph to present the strengths and weaknesses between two teams.

In [45]:
### TEAM COMPARATOR
def team_comparator(team1,team2):
  team_list = [team1, team2]
  length    = len(team_list)
  cr        = ["b","r"]
  fig = plt.figure(figsize=(10,8))
  plt.subplot(111,projection= "polar")

  for i,j,k in itertools.zip_longest(team_list,range(length),cr):
    cats = num_col
    N    = len(cats)
        
    values = n[n["team_long_name"] ==  i][cats].values.flatten().tolist()
    values += values[:1]
        
    angles = [n/float(N)*2*pi for n in range(N)]
    angles += angles[:1]
        
    plt.xticks(angles[:-1],cats,color="k",fontsize=15)
    plt.plot(angles,values,linewidth=3,color=k)
    plt.fill(angles,values,color = k,alpha=.4,label = i)
    plt.legend(loc="upper right",frameon =True,prop={"size":15}).get_frame().set_facecolor("lightgrey")
    fig.set_facecolor("w")
    fig.set_edgecolor("k")
    plt.title("Team Comparator", fontsize=20, color="blue")

### Real Madrid CF vs. FC Barcelona:
team_comparator("Real Madrid CF","FC Barcelona")

### Manchester United vs Liverpool:
team_comparator("Manchester United","Liverpool")

About Matches - Merge Matches with Leagues and Countries

In [46]:
## Merge Country and leauge DataFrames

DFs["dfCountryLeague"] = DFs['dfCountry'].merge(DFs['dfLeague'], left_on="id", right_on="id", how="outer")
DFs["dfCountryLeague"] = DFs["dfCountryLeague"].drop("id", axis=1)
DFs["dfCountryLeague"] = DFs["dfCountryLeague"].rename(columns={'name_x':"country", 'name_y':"league"})

print(DFs["dfCountry"].head())
print(DFs["dfLeague"].head())
DFs["dfCountryLeague"].head()
      id     name
0      1  Belgium
1   1729  England
2   4769   France
3   7809  Germany
4  10257    Italy
      id  country_id                    name
0      1           1  Belgium Jupiler League
1   1729        1729  England Premier League
2   4769        4769          France Ligue 1
3   7809        7809   Germany 1. Bundesliga
4  10257       10257           Italy Serie A
Out[46]:
country country_id league
0 Belgium 1 Belgium Jupiler League
1 England 1729 England Premier League
2 France 4769 France Ligue 1
3 Germany 7809 Germany 1. Bundesliga
4 Italy 10257 Italy Serie A
In [47]:
## Select attributes to form a new Match dataframe
DFs['dfMatch_New'] = DFs['dfMatch'][['id', 'country_id', 'league_id', 'season', 'stage', 'date',
                   'match_api_id', 'home_team_api_id', 'away_team_api_id', 'home_team_goal', 'away_team_goal']]

DFs['dfMatch_New'] = DFs['dfMatch_New'].drop("id", axis=1)

print("Dimension of dfMatch: {} vs dfMatch_New: {}".format(DFs['dfMatch'].shape, DFs['dfMatch_New'].shape))
## DFs['dfMatch_New'].head()
Dimension of dfMatch: (25979, 115) vs dfMatch_New: (25979, 10)
In [48]:
## Merge leauge DataFrame with Match
## data = matches_new.merge(countries_leagues,left_on="country_id",right_on="country_id",how="outer")
dfMatchCountryLeague = DFs["dfMatch_New"].merge(DFs["dfCountryLeague"],
                                                left_on="country_id",
                                                right_on="country_id", how="outer")

## check null values
dfMatchCountryLeague.isna().sum()
Out[48]:
country_id          0
league_id           0
season              0
stage               0
date                0
match_api_id        0
home_team_api_id    0
away_team_api_id    0
home_team_goal      0
away_team_goal      0
country             0
league              0
dtype: int64
In [49]:
## Unique values in dfMatchCountryLeague DataFrame
dfMatchCountryLeague.nunique()
Out[49]:
country_id             11
league_id              11
season                  8
stage                  38
date                 1694
match_api_id        25979
home_team_api_id      299
away_team_api_id      299
home_team_goal         11
away_team_goal         10
country                11
league                 11
dtype: int64

Geographic Location Visualization

This section I experiecned the basemap.Basemap module to get a visual on where these matches were hold around the globe.

In [50]:
## Import Latitude and Longitude info:
DFs['dfLatLong'] = pd.read_excel("input/latlongSoccer.xlsx", sheetname = "Sheet1")

## Blend in the Latitude / Longitude info to mark on geographic globe map:
## Ref. https://www.kaggle.com/pavanraj159/european-football-data-analysis

DFs['dfGeoCountry'] = DFs['dfCountryLeague'].merge(DFs['dfLatLong'], left_on="country", right_on="name", how="left")
DFs['dfGeoCountry'] = DFs['dfGeoCountry'].drop(["country_id", "country_y", "name"], axis = 1)

baseMap = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=10, urcrnrlat=80, llcrnrlat=-80)

plt.figure(figsize=(12,12))

country = list(DFs['dfGeoCountry']["country_x"].unique())
cplt    = sns.color_palette("Set1", 11)
label   = country

def drawMap(country, cplt, label):
  '''Draw map by marking a Plus sign'''
  lat = list(DFs['dfGeoCountry'][DFs['dfGeoCountry']["country_x"] == country].latitude)
  lon = list(DFs['dfGeoCountry'][DFs['dfGeoCountry']["country_x"] == country].longitude)
  x, y = baseMap(lon, lat)
  baseMap.plot(x, y, "P", markersize=7, color=j, alpha=.8, label=i)

for i,j in zip(country, cplt):
  drawMap(i, j, i)

baseMap.bluemarble(scale=0.5)

plt.legend(loc="center right", frameon=True, prop={"size":15}).get_frame().set_facecolor("white")
plt.title("Matches In Countries")
plt.show()
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

Where Were the Leagues Located in Europe?

In [51]:
baseMap01 = Basemap(projection="merc", llcrnrlat=35, urcrnrlat=60,
                    llcrnrlon=-12, urcrnrlon=25, lat_ts=20, lat_0=True, lon_0=True)

plt.figure(figsize=(15,10))

baseMap01.drawmapboundary(fill_color="skyblue",color="k",linewidth=2)
baseMap01.drawcoastlines(linewidth=2)
baseMap01.drawcountries(linewidth=2, color="grey")
baseMap01.fillcontinents(color="gold", alpha=1, lake_color="b")

lea = list(DFs['dfGeoCountry']["league"].unique())
cplt = sns.color_palette("deep",11)
lat = list(DFs['dfGeoCountry'][DFs['dfGeoCountry']["league"] == lea].latitude)
lon = list(DFs['dfGeoCountry'][DFs['dfGeoCountry']["league"] == lea].longitude)
x, y  = baseMap01(lon,lat) 

def drawMapLeagueCountry(lea, cplt):
  lat = list(DFs['dfGeoCountry'][DFs['dfGeoCountry']["league"] == lea].latitude)
  lon = list(DFs['dfGeoCountry'][DFs['dfGeoCountry']["league"] == lea].longitude)
  x,y = baseMap01(lon, lat)
  baseMap01.plot(x, y, "*", markersize=15, color=j, alpha=.8, linewidth=20)
  ## baseMap01.plot(x, y, "go", markersize=10, color=j, alpha=.8, linewidth=20)
    
for i,j in zip(lea, cplt):
  drawMapLeagueCountry(i,j)
        
for i,j,k in itertools.zip_longest(x, y, DFs['dfGeoCountry']["league"]):
  plt.text(i, j, k, fontsize =10, color="k", horizontalalignment='center', verticalalignment='center', weight="bold")

plt.title("League Locations In Different Countries")
plt.show()

Show the number of Matches played in each League

In [52]:
plt.figure(figsize=(8,8))

ax = sns.countplot(y = dfMatchCountryLeague["league"],
                   order = dfMatchCountryLeague["league"].value_counts().index,
                   linewidth = 1,
                   edgecolor = "k" * dfMatchCountryLeague["league"].nunique()
                 )

for i,j in enumerate(dfMatchCountryLeague["league"].value_counts().values):
  ax.text(.7, i, j, weight = "bold")

plt.title("Number of Matches by Leagues")
plt.show()
In [53]:
## Converting date format in the Matches by Countires and Leagues DataFrame 
dfMatchCountryLeague["date"] = pd.to_datetime(dfMatchCountryLeague["date"], format="%Y-%m-%d")

## Extracting year
dfMatchCountryLeague["year"] = pd.DatetimeIndex(dfMatchCountryLeague["date"]).year

Matches played in each League by Season

In [54]:
plt.figure(figsize=(14, 10))

sns.countplot(y = dfMatchCountryLeague["season"], hue=dfMatchCountryLeague["league"],
              palette=["r","g","b","c","lime","m","y","k","gold","orange"])

plt.title("Matches Played In Each League By Season")
plt.show()

Does Home-Field Advantage Matter?

To find out I need to blend in the Team dataframe with the previous Match-Country-League one.

The dfMatch dataframe contains goals and match results. I want to see if "home advantage" exists by comparing average goals when teams were in their home fields vs when they were visiting.

We can see the average goal when teams were at Home is higher than the average when teams were visiting. The KDE chart does reveal, in general, "home advantage" exists in terms of goals.

In [55]:
## Merge Home Team data:
dfMatchCountryLeague = dfMatchCountryLeague.merge(DFs['dfTeam'],
                                                  left_on="home_team_api_id",
                                                  right_on="team_api_id",
                                                  how="left")

dfMatchCountryLeague = dfMatchCountryLeague.drop(["id", "team_api_id", 'team_fifa_api_id'], axis = 1)

dfMatchCountryLeague = dfMatchCountryLeague.rename(columns={'team_long_name':"home_team_lname",
                                                                          'team_short_name':"home_team_sname"})

## Meerge Away Team data:
dfMatchCountryLeague = dfMatchCountryLeague.merge(DFs['dfTeam'],
                                                  left_on="away_team_api_id",
                                                  right_on="team_api_id",
                                                  how="left")

dfMatchCountryLeague = dfMatchCountryLeague.drop(["id","team_api_id",'team_fifa_api_id'],axis = 1)

dfMatchCountryLeague = dfMatchCountryLeague.rename(columns={ 'team_long_name':"away_team_lname",
                                                                          'team_short_name':"away_team_sname"})

### Distribution of Home and Away Goals
plt.figure(figsize=(12, 6))

sns.kdeplot(dfMatchCountryLeague["home_team_goal"], shade=True, color="b", label="Home Goals")
sns.kdeplot(dfMatchCountryLeague["away_team_goal"], shade=True, color="r", label="Away Goals")

plt.axvline(dfMatchCountryLeague["home_team_goal"].mean(), linestyle="dashed", color="b", label="Home Goals Avg")
plt.axvline(dfMatchCountryLeague["away_team_goal"].mean(), linestyle="dashed", color="r", label="Away Goals Avg")

plt.legend(loc="best", prop = {"size" : 12})
plt.title("Distribution of Home and Away Goals")
plt.xlabel("Goals")
plt.show()

Home and Away goals in each League

In [56]:
## Derive a Total_Goal feature: 
dfMatchCountryLeague["total_goal"] = dfMatchCountryLeague["home_team_goal"] + dfMatchCountryLeague["away_team_goal"]

## Gather goals by playing at home field vs in visiting:
dfMatchCountryLeague.groupby("league").agg({"home_team_goal":"sum", "away_team_goal":"sum"}).plot(kind="barh",
                                                                                 figsize = (10,10),
                                                                                 edgecolor = "k",
                                                                                 linewidth =1)
plt.title("Home and Away goals by Leagues")
plt.legend( loc= "best" , prop= {"size" : 14})
plt.xlabel("Total Goals")
plt.show()

Top 20 Winning Teams and Goals, At Home vs. Away Visiting

This chart shows top 20 highest goals scored when teams played at home vs away from home. It depicts a fact that, home or not, top team performance remained balanced and competitive. Teams, such as Real Madrid FC, FC Barcelona, and Celtic, etc., took over the top 3 scores in both Home Goals and Away one.

In [57]:
DFs['dfHomeTeam'] = dfMatchCountryLeague.groupby("home_team_lname")["home_team_goal"].sum().reset_index()
DFs['dfAwayTeam'] = dfMatchCountryLeague.groupby("away_team_lname")["away_team_goal"].sum().reset_index()

DFs['dfHomeTeam'] = DFs['dfHomeTeam'].sort_values(by="home_team_goal",ascending= False)
DFs['dfAwayTeam'] = DFs['dfAwayTeam'].sort_values(by="away_team_goal",ascending= False)

plt.figure(figsize=(13,10))
plt.subplot(121)

ax = sns.barplot(y="home_team_lname", x="home_team_goal", data=DFs['dfHomeTeam'][:20], palette="summer",
                 linewidth = 1, edgecolor = "k"*20)

plt.ylabel('')
plt.title("Top Teams by Home Goals")

for i,j in enumerate(DFs['dfHomeTeam']["home_team_goal"][:20]):
  ax.text(.7, i, j, weight = "bold")


plt.subplot(122)
ax = sns.barplot(y="away_team_lname", x="away_team_goal", data=DFs['dfAwayTeam'][:20],
                 palette="winter", linewidth = 1, edgecolor = "k"*20)

plt.ylabel("")
plt.subplots_adjust(wspace = .4)
plt.title("Top Teams by Visiting Goals")

for i,j in enumerate(DFs['dfAwayTeam']["away_team_goal"][:20]):
  ax.text(.7, i, j, weight = "bold")

Proportion of Game Outcomes

This Pie chart shows the proportion of the parties. As we can see teams at home won 46% of the games compared with 29% and 25% of game ties in the past 8 years from 2008 to 2016.

Since teams at home claimed more victories than those in visit the pie chart suggested that "Home Advantage" exist.

In [58]:
## New feature: Won; a new function to return the Winning team name
def lblWonTeam(df):
  '''Return the name of the Winning team'''
  if df["home_team_goal"] > df["away_team_goal"]:
    return df["home_team_lname"]
  elif df["away_team_goal"] > df["home_team_goal"]:
    return df["away_team_lname"]
  elif df["home_team_goal"] == df["away_team_goal"]:
    return "DRAW"

dfMatchCountryLeague["Won"] = dfMatchCountryLeague.apply(lambda df: lblWonTeam(df), axis=1)

## New feature: Lost; a new function to return the team which lost in a game
def lblLostTeam(df):
  '''Return the name of the Lost team'''
  if df["home_team_goal"] < df["away_team_goal"]:
    return df["home_team_lname"]
  elif df["away_team_goal"] < df["home_team_goal"]:
    return df["away_team_lname"]
  elif df["home_team_goal"] == df["away_team_goal"]:
    return "DRAW"

dfMatchCountryLeague["Lost"] = dfMatchCountryLeague.apply(lambda df: lblLostTeam(df), axis=1)

## New feature: Outcome; a new function to return whether the Winning team is at Home field or Visiting
def lbl(df):
  '''Return a label for either Home Team or Visiting Team Won'''
  if df["home_team_goal"] > df["away_team_goal"]:
    return "Home Team Won"
  elif df["away_team_goal"] > df["home_team_goal"]:
    return "Visiting Team Won"
  elif df["home_team_goal"] == df["away_team_goal"]:
    return "DRAW"

dfMatchCountryLeague["Outcome"] = dfMatchCountryLeague.apply(lambda df: lbl(df), axis = 1)

## Depict the Proportion of Game Outcomes with labels 
plt.figure(figsize=(8,8))
dfMatchCountryLeague["Outcome"].value_counts().plot.pie(autopct = "%1.0f%%",
                                             colors = sns.color_palette("rainbow", 3),
                                             wedgeprops = {"linewidth":2, "edgecolor":"white"})

my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.title("Proportion of Game Outcomes")
plt.show()

Teams with Top 30 Highest Total Goals

In [59]:
DFs['dfHomeTeam'] = DFs['dfHomeTeam'].rename(columns={'home_team_lname':"team", 'home_team_goal':"goals"})
DFs['dfAwayTeam'] = DFs['dfAwayTeam'].rename(columns={'away_team_lname':"team", 'away_team_goal':"goals"})

DFs['dfGoals'] = pd.concat([DFs['dfHomeTeam'], DFs['dfAwayTeam']])
DFs['dfGoals'] = DFs['dfGoals'].groupby("team")["goals"].sum().reset_index().sort_values(by="goals", ascending= False)

plt.figure(figsize=(9, 14))
ax = sns.barplot(x="goals", y="team", data=DFs['dfGoals'][:30], palette="rainbow", linewidth= 1, edgecolor= "k"*30)

for i,j in enumerate(DFs['dfGoals']["goals"][:30]):
    ax.text(.3, i, j, weight="bold", color = "k", fontsize =12)

plt.title("Teams with highest total goals ")
plt.show()

Teams Played the Most Matches

In [60]:
dfMatchPlayed_HomeTeam = dfMatchCountryLeague.groupby("home_team_lname")["match_api_id"].count().reset_index()
dfMatchPlayed_HomeTeam = dfMatchPlayed_HomeTeam.rename(columns={"home_team_lname":"team"})

dfMatchPlayed_AwayTeam = dfMatchCountryLeague.groupby("away_team_lname")["match_api_id"].count().reset_index()
dfMatchPlayed_AwayTeam = dfMatchPlayed_AwayTeam.rename(columns={"away_team_lname":"team"})

dfMatchPlayed_Team = pd.concat([dfMatchPlayed_HomeTeam, dfMatchPlayed_AwayTeam], axis=0)

dfMatchPlayed_Team = dfMatchPlayed_Team.groupby("team")["match_api_id"].sum().reset_index().sort_values(
                                                                        by="match_api_id", ascending =False)

dfMatchPlayed_Team = dfMatchPlayed_Team.rename(columns={"match_api_id":"matches_played"})

## print("Number of Matches played: {}".format(dfMatchPlayed_Team["matches_played"].max()) 
## Max number of matches: 304
dfMatchPlayed_Team[ dfMatchPlayed_Team["matches_played"] == dfMatchPlayed_Team["matches_played"].max() ]
Out[60]:
team matches_played
51 Chelsea 304
22 Athletic Club de Bilbao 304
186 Paris Saint-Germain 304
256 Sunderland 304
180 Olympique de Marseille 304
179 Olympique Lyonnais 304
255 Stoke City 304
83 FC Lorient 304
273 Valencia CF 304
168 Málaga CF 304
167 Motherwell 304
75 FC Barcelona 304
162 Manchester United 304
49 Celtic 304
161 Manchester City 304
262 Tottenham Hotspur 304
158 Liverpool 304
70 Everton 304
252 Stade Rennais FC 304
144 Kilmarnock 304
146 LOSC Lille 304
23 Atlético Madrid 304
177 OGC Nice 304
61 Dundee United 304
202 RCD Espanyol 304
211 Real Madrid CF 304
13 Aberdeen 304
112 Getafe CF 304
240 Sevilla FC 304
114 Girondins de Bordeaux 304
263 Toulouse FC 304
19 Arsenal 304
20 Aston Villa 304
11 AS Saint-Étienne 304

Network Layout For Matches Played among Teams with Top Scores

In [61]:
## Selecting Top 50 teams with highest goals

## lstTopScores = list(DFs['dfGoals']["team"][:50])
lstTopScores = list(DFs['dfGoals']["team"][:20])

DFs['dfTopScores'] = dfMatchCountryLeague[["home_team_lname","away_team_lname"]]
DFs['dfTopScores'] = DFs['dfTopScores'][(DFs['dfTopScores']["home_team_lname"].isin(lstTopScores)) &
                                        (DFs['dfTopScores']["away_team_lname"].isin(lstTopScores))]

## import networkx as nx

grf = nx.from_pandas_edgelist(DFs['dfTopScores'], "home_team_lname", "away_team_lname")

fig = plt.figure(figsize=(5, 5))
nx.draw_kamada_kawai(grf, with_labels=True, node_size =2500, node_color ="Orangered", alpha=.8)

plt.title("Network Layout For Matches Played among Teams with Top Scores")
fig.set_facecolor("white")

Interaction Among Teams

For the sake of completeness I created a network map to depict all matching teams. Although it may not look very useful the mesh map did illustrate the relationships among teams.

In [62]:
grf = nx.from_pandas_edgelist(dfMatchCountryLeague, "home_team_sname", "away_team_sname")

fig = plt.figure(figsize=(11,11))
nx.draw_kamada_kawai(grf, with_labels = True)
plt.title("Interaction Among Teams")
fig.set_facecolor("white")

Matches vs Goals By Leagues

This scatter plot shows the number of matches vs. the total goals in all (11) leagues. As expected the total goals are correlated with the number of matches played in leagues. More matches played higher the total goals.

In [63]:
## Use groupby() and agg() to retrieve Goals and Matches to a new dataframe:
DFs['dfMatchesGoals_HomeTeam'] = dfMatchCountryLeague.groupby(
  ["home_team_lname", "league"]).agg({"match_api_id":"count", "home_team_goal":"sum"}).reset_index()

DFs['dfMatchesGoals_AwayTeam'] = dfMatchCountryLeague.groupby(
  ["away_team_lname", "league"]).agg({"match_api_id":"count", "away_team_goal":"sum"}).reset_index()

## Reanme key columns for concatenation: 
DFs['dfMatchesGoals_HomeTeam'] = DFs['dfMatchesGoals_HomeTeam'].rename(columns={'home_team_lname':"team",
                                                                                'match_api_id':"matches",
                                                                                'home_team_goal':"goals"})

DFs['dfMatchesGoals_AwayTeam'] = DFs['dfMatchesGoals_AwayTeam'].rename(columns={'away_team_lname':"team",
                                                                                'match_api_id':"matches",
                                                                                'away_team_goal':"goals"})

## Concatenate both dataframes
DFs['dfMatchesGoals_HomeAwayCombined'] = pd.concat([DFs['dfMatchesGoals_HomeTeam'], DFs['dfMatchesGoals_AwayTeam']])

## Groupby Team and League:
DFs['dfMatchesGoals_HomeAwayCombined'] = DFs['dfMatchesGoals_HomeAwayCombined'].groupby(["team", "league"]
                                                                                       )[["matches", "goals"]
                                                                                        ].sum().reset_index()

## Ranking based on Goals:
DFs['dfMatchesGoals_HomeAwayCombined']= DFs['dfMatchesGoals_HomeAwayCombined'].sort_values(by="goals",ascending=False)

## Visualize result:
plt.figure(figsize=(14, 9))
cplt   = ["r","g","b","m","y","yellow","c","orange","grey","lime","white"]
arrTopScoreLeagues = DFs['dfMatchesGoals_HomeAwayCombined']["league"].unique()  ## numpy.ndarray

for i, j, k in itertools.zip_longest(arrTopScoreLeagues, range(len(arrTopScoreLeagues)), cplt):
  
  plt.scatter("matches", "goals",
              data=DFs['dfMatchesGoals_HomeAwayCombined'][ DFs['dfMatchesGoals_HomeAwayCombined']["league"] == i],
              label=[i], s=100, alpha=1, linewidths=1, edgecolors="k", color=k)
  
  plt.legend(loc="best")
  plt.xlabel("Matches")
  plt.ylabel("Goals Scored")

plt.title("Matches vs. Goals By Teams")
plt.show()

Matches vs Goals By Top 50 Teams

In [64]:
plt.figure(figsize=(10, 12))

cplt = sns.color_palette("deep", 50)

plt.scatter(y = DFs['dfMatchesGoals_HomeAwayCombined']["team"][:50], 
            x = DFs['dfMatchesGoals_HomeAwayCombined']["matches"][:50],
            s = DFs['dfMatchesGoals_HomeAwayCombined']["goals"],
            alpha=.7, c=cplt, linewidths=1, edgecolors="b")

plt.xticks(rotation = 90)
plt.xlabel("Matchess Played")
plt.title("Matches vs. Goals By Top 50 Teams")
plt.show()

Home Goals Scored By Teams at Home By Season

In [65]:
### Home Goals Scored By Teams at Home By Season

wonSeason = dfMatchCountryLeague["Won"].value_counts()[1:25].index

pvHomeGoals = pd.pivot_table(index="home_team_lname", columns="season", values="home_team_goal",
                  data=dfMatchCountryLeague, aggfunc="sum")

pvHomeGoals = pvHomeGoals[pvHomeGoals.index.isin(wonSeason)]

pvHomeGoals.plot(kind="barh", stacked=True, figsize=(13, 10), colors=sns.color_palette("prism",11))
plt.title("Home Goals Scored by Teams At Home by Season")

plt.show()

Away Goals Scored When Teams Traveled By Season

In [66]:
### Away Goals Scored When Teams Traveled By Season
## wonSeason = dfMatchCountryLeague["Won"].value_counts()[1:25].index

pvAwayGoals = pd.pivot_table(index="away_team_lname", columns="season",
                  values="away_team_goal", data=dfMatchCountryLeague, aggfunc="sum")

pvAwayGoals = pvAwayGoals[pvAwayGoals.index.isin(wonSeason)]

pvAwayGoals.plot(kind="barh",stacked=True, figsize=(13,10),colors=sns.color_palette("prism", 11))
plt.title("Away Goals Scored When Teams Traveled By Season")
plt.show()

Proportion Of Matches Played And Goals Scored In Leagues

In [67]:
## Derive Total_Goal feature 
dfMatchCountryLeague["total_goal"] = dfMatchCountryLeague["home_team_goal"] + dfMatchCountryLeague["away_team_goal"]

plt.figure(figsize=(7,15))

plt.subplot(211)
agg = dfMatchCountryLeague.groupby(["league"]).agg({"match_api_id":"count", "total_goal":"sum"})

agg["match_api_id"].plot.pie(colors=sns.color_palette("seismic",10), autopct="%1.0f%%",
                             wedgeprops={"linewidth":2, "edgecolor":"white"})

plt.ylabel("Matches")
my_circ = plt.Circle((0,0), .7, color ="white")
plt.gca().add_artist(my_circ)
plt.title("Proportion Of Matches Played in Leagues")


plt.subplot(212)
agg["total_goal"].plot.pie(colors=sns.color_palette("seismic", 10), autopct="%1.0f%%",
                           wedgeprops={"linewidth":2, "edgecolor":"white"})
plt.ylabel("Goals")
my_circ = plt.Circle((0,0), .7, color ="white")
plt.gca().add_artist(my_circ)
plt.title("Proportion of Goals Scored in Leagues")
plt.show()

Top Winning & Losing Teams

In [68]:
dfWon = dfMatchCountryLeague["Won"].value_counts()[1:].reset_index()
dfLost = dfMatchCountryLeague["Lost"].value_counts()[1:].reset_index()

plt.figure(figsize=(13, 14))

plt.subplot(121)
ax = sns.barplot( dfWon["Won"][:30], dfWon["index"][:30], palette="Set2", linewidth = 1, edgecolor = "k"*30)

plt.title(" TOP WINNING TEAMS")
plt.ylabel("")

for i,j in enumerate(dfWon["Won"][:30]):
  ax.text(.7, i, j, color = "black", weight = "bold")


plt.subplot(122)
ax = sns.barplot(dfLost["Lost"][:30], dfLost["index"][:30], palette="Set2", linewidth = 1, edgecolor = "k"*30)

plt.title(" TOP TEAMS that Lost")
plt.subplots_adjust(wspace = .3)
plt.ylabel("")

for i,j in enumerate(dfLost ["Lost"][:30]):
  ax.text(.7, i, j, color = "black", weight = "bold")

Performance By Top Teams

In [69]:
### Merge Won, Draw and Lost data of team to matches played

f = DFs['dfMatchesGoals_HomeAwayCombined'].merge(dfWon, left_on="team", right_on="index", how="left")
f = f.drop("index", axis =1)
f = f.rename(columns={"Outcome":"wins"})

f = f.merge(dfLost, left_on="team", right_on="index", how="left")
f = f.drop("index", axis =1)
dr = dfMatchCountryLeague[ dfMatchCountryLeague["Outcome"] == "DRAW"][["home_team_lname", "away_team_lname"]]

l  = dr["home_team_lname"].value_counts().reset_index()
v  = dr["away_team_lname"].value_counts().reset_index()

l  = l.rename(columns={'index':"team", 'home_team_lname':"draw"})
v  = v.rename(columns={'index':"team", 'away_team_lname':"draw"})

lv = pd.concat([l,v])
lv = lv.groupby("team")["draw"].sum().reset_index()

f = f.merge(lv, left_on="team", right_on="team", how ="left")

## PERFORMANCE BY TOP TEAMS

f = f.sort_values(by="goals",ascending=False)

f_new = f.copy()
f_new.index = f_new.team
f_new[["Won","Lost","draw"]][:20].plot(kind = "bar", figsize=(13,5), stacked =True, linewidth = 1, edgecolor = "k"*20)

plt.legend(loc="best", prop = {"size" : 13})
plt.title("Performance By Top Teams")
plt.ylabel("Matches Played")
plt.show()

Won vs Lost vs Draw

In [70]:
## from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10,10))
ax  = fig.add_subplot(111, projection ="3d")

cplt = sns.color_palette("deep", 296)
ax.scatter(f["Won"], f["Lost"], f["draw"], s=f["matches"]*3, alpha=.4, linewidth =1, edgecolor= "k", c = cplt)
## ax.scatter(f["Won"], f["Lost"], f["draw"], s=f["matches"]*3, alpha=.4, linewidth =1, edgecolor= "k", c = "lime")

ax.set_xlabel("wins")
ax.set_ylabel("lost")
ax.set_zlabel("draw")

plt.title("Won VS Lost VS Draw")
plt.show()

Area Plot for Match Attributes

In [71]:
plt.figure(figsize=(13, 6))

cols = ["matches", "Won", "Lost", "draw"]
c    = ["b","orange","lime","m"]
length = len(cols)

for i,j,k in itertools.zip_longest(cols, range(length), c):
  plt.stackplot(f.index,f[i], alpha=.6, color = k, labels=[i])
  plt.axhline(f[i].mean(),color=k, linestyle="dashed", label=i+ " mean")
  plt.legend(loc="best")
  plt.title("AREA PLOT FOR MATCH ATTRIBUTES")
  plt.xlabel("team index")

Number of Matches Played In Each League vs. Total Goals Scored

In [72]:
dfMatchCountryLeague.groupby(
  ["league"]).agg({"match_api_id":"count", "total_goal":"sum"}).plot(
  kind="barh", stacked =True, figsize=(12, 6), linewidth = 1, edgecolor = "k"*dfMatchCountryLeague["league"].nunique())

plt.title("Number of Matches Played In Each League vs. Total Goals Scored")
plt.show()

Word Cloud of More-Than-100-Win Teams

Again, use a Word Cloud to visualize the frequency of words, in this case winning teams, which won more than 100 games between 2008 and 2015. I saw FC Barcelona, Real Madrid FC and Celtic stood out from the cloud.

In [73]:
## from wordcloud import WordCloud
## import nltk

wrd = dfMatchCountryLeague[dfMatchCountryLeague["Won"] != "DRAW"]["Won"].to_frame()
wrd = wrd["Won"].value_counts()[wrd["Won"].value_counts() > 100].keys().str.replace(" ","")
wrd = pd.DataFrame(wrd)

img = np.array(Image.open("input/picture-wrd/zSoccer_LcdozeKqi.jpg"))

wc = WordCloud(background_color="gray", scale=2, mask=img, colormap="prism").generate(str(wrd[0]))

plt.figure(figsize=(13, 10))
plt.imshow(wc, interpolation="bilinear")

plt.axis("off")
plt.title("Top Teams")
plt.show()

Goals Scored in each Season of Leagues

In [74]:
pvt = pd.pivot_table(index="season", columns="league", values="total_goal", data=dfMatchCountryLeague, aggfunc="sum")

pvt.plot(kind = "barh", stacked = True, figsize =(12, 5),
         colors =sns.color_palette("rainbow",11),
         linewidth = .5,edgecolor = ["grey"]*10)

plt.title("Goals Scored in each Season of Leagues")
plt.show()

Matches Played Among Top Teams

In [75]:
wonSeason50 = dfMatchCountryLeague["Won"].value_counts()[1:50].index

wonTeams = dfMatchCountryLeague[(dfMatchCountryLeague["home_team_lname"].isin(wonSeason50)) &
                                (dfMatchCountryLeague["away_team_lname"].isin(wonSeason50))]

ctwonTeams = pd.crosstab(wonTeams["home_team_lname"], wonTeams["away_team_lname"])

plt.figure(figsize=(13,10))
sns.heatmap(ctwonTeams, annot=True, cmap=sns.color_palette("inferno"))
plt.title("Count Of Matches Played Between Top Teams")
plt.show()

Championship Teams in the Past Eight Seasons By Leagues

This section illustrates the championship teams of each league in the past eight seasons. This dataset covers one league from each country so we have only 11 leagues.

I created a function to draw barplots, one for each league.

Note: Since the argument in matplotlib subplot() requires a 3-dight integer we need to use the comma format, i.e. (row, col, order), on this argument when the number of plots goes beyond 10

Next, I leveraged violin plots to view the median and probability density of total goals scored in each league.

At the end, we made a graphic illustration of a rank of the top teams across all leagues based on the number of championships they won in their leagues.

In [76]:
## Print League Names 
Leagues = getMbrNames(DFs['dfLeague'])
print("Selected European Football Leagues:\n\n{}".format(";\n".join(Leagues)))

dfWonSeasonLeague = dfMatchCountryLeague[["season", "league", "Won"]]
dfWonSeasonLeague["team"] = dfWonSeasonLeague["Won"]
dfWonSeasonLeague = dfWonSeasonLeague.groupby(
  ["season","league","team"])["Won"].count().reset_index().sort_values(by=["season","league","Won"], ascending=False)

dfWonSeasonLeague = dfWonSeasonLeague[dfWonSeasonLeague["team"] != "DRAW"]

dfWonSeasonLeague = dfWonSeasonLeague.drop_duplicates(subset=["season", "league"],
                                                      keep="first").sort_values(by=["league","season"],ascending=True)

leagues = [lg for lg in DFs['dfLeague']['name']]  ## 11 is over 9. subplot() can't seem to handle.

plt.figure(figsize=(13, 40))

def draw_Barplot_WonSeasonLeague(league, *pos):
  
  row, col, order = pos
  plt.subplot(row, col, order)

  lg = dfWonSeasonLeague[dfWonSeasonLeague["league"] == league]
  ax = sns.barplot(lg["Won"], lg["season"], palette="cool",
                  linewidth = 1, edgecolor = "k"*lg["season"].nunique())  

  for i,j in enumerate(lg["team"]):
      ax.text(.7, i, j, weight = "bold")

  plt.title(league)
  plt.xlabel("Won")
  plt.ylabel("Season")
  
row, col, init = 9, 2, 1
pos = [row, col, init]

for league in leagues:  
  draw_Barplot_WonSeasonLeague(league, *pos)
  pos[2] += 1

plt.show()
Selected European Football Leagues:

Belgium Jupiler League;
England Premier League;
France Ligue 1;
Germany 1. Bundesliga;
Italy Serie A;
Netherlands Eredivisie;
Poland Ekstraklasa;
Portugal Liga ZON Sagres;
Scotland Premier League;
Spain LIGA BBVA;
Switzerland Super League

Violin Plot for Total Goals Scored By Leagues

Similar to a box plot, a Violin plot can tell us the median and interquartile range of a numerical feature in a data set. In addition, it also shows the probability density of the data at different values which are usually smoothed by a kernel density estimator.

In [77]:
plt.figure(figsize=(14, 4))

sns.violinplot(dfMatchCountryLeague["league"], dfMatchCountryLeague["total_goal"], palette="rainbow")
plt.title("Goals Scored in Leagues")
plt.xticks(rotation = 60)
plt.show()

Top Teams By Leagues

The following barchart visualized the teams which won the most championships in history in their leagues.

In [78]:
x = pd.DataFrame(dfMatchCountryLeague.groupby(["league", "Won"])["Won"].count())
## print(x.columns)  ## Index(['Won'], dtype='object')

x = x.rename(columns={"Won":"team"}).reset_index()
## print(x.columns)  ## Index(['league', 'Won', 'team'], dtype='object')

x = x.rename(columns={"Won":"team", "team":"Won"})
## print(x.columns)  ## Index(['league', 'team', 'Won'], dtype='object')

x = x.sort_values(by="Won",ascending=False)

x = x[x["team"] != "DRAW"]
x = x.drop_duplicates(subset=["league"], keep="first")

plt.figure(figsize=(8, 5))
ax =sns.barplot(x["Won"],x["league"], palette="cool", linewidth = 1, edgecolor = "k"*10)

for i,j in enumerate(x["team"]):
  ax.text(.7,i,j,weight = "bold",fontsize = 12)

plt.title("Top Teams By Leagues")
plt.show()

Rank the Best Championship Teams Across All Leagues

This diagram gives us an idea which teams won the most championships in their leagues and also rank these top teams based on the number they achieved. Although the levels of championship competition were different among the leagues the rank still identified the teams with the most glory in history in their leagues. This rank can serve as a stepping stone for next research to find out which team is the best of the best in history.

In [79]:
plt.figure(figsize=(8, 14))

ax = sns.countplot(y=dfWonSeasonLeague["team"], order=dfWonSeasonLeague["team"].value_counts().index,
                   palette="plasma", linewidth = 1, edgecolor = "k" * dfWonSeasonLeague["team"].nunique())

plt.title("Best Championship Team in Each League")

for i,j in enumerate(dfWonSeasonLeague["team"].value_counts().values):
  ax.text(.2, i, j, color = "white", weight="bold")

Conclusions

Like baseball and basketball, Soccer is a team sport. Both players and teams are the center of focus when it comes to evaluation and prediction. Matches and leagues are built around them, followed by countries and others.

Analyzing the league, country, player, player attributes, team and team atributes tables gave a better understanding of the data. Once the respective features are joined and merged with the match table, machine learning algorithms can be used to predict the winner of the future soccer matches in the european league. This kernel also allows for the opportunity to practice using the seaborn library and visualizing the data.

Domain knowledge is essential in data analysis. In this case deep understanding of the meaning of each attributes can certainly helps on these decision:

  • Choose relavent categories when making radar charts,
  • Recognize key attributes to evaluate players,
  • select criteria to identify so-called top rated players, etc.

Fortunately, search engines are our friends. Blogs, news, tutorials can be good resouces and are only a few keystrokes away to build required knowledge.

In this project I created graphic illustration to answer questions. I created Team Comparator and Player Comparator. Some of the questions are listed as follows,

  • Does Player's "Overall Rating" depend on any Player Attribute(s)?
  • What are the similarities and differences in terms of skills among Top Overall_Rating Players?
  • Create a Word Cloud for players with Overall Rating beyond 88
  • Does the proportion of preferred foot change between all players and top players with overall rating higher than 80?, etc.

I think the work I built in this project can serve as a good fundation for further analyses.

Lock database by closing Cursor and DB Connection:

In [80]:
## Lock database by closing Cursor and DB Connection:
cursor.close()
conn.close()
In [81]:
# from subprocess import call
# call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset_completed.ipynb'])